PaperBlogGithubVisualizerAutoRace LeaderboardSlides

Generating accurate step-by-step reasoning is essential for Large Language Models (LLMs) to address complex problems and enhance robustness and interpretability. Despite the flux of research on developing advanced reasoning approaches, systematically analyzing the diverse LLMs and reasoning strategies in generating reasoning chains remains a significant challenge.

The difficulties stem from the lack of two key elements: (1) an automatic method for evaluating the generated reasoning chains on different tasks, and (2) a unified formalism and implementation of the diverse reasoning approaches for systematic comparison.


This work aims to close the gap:

  • We introduce AutoRace for fully automated reasoning chain evaluation. Existing metrics rely on expensive human annotations or pre-defined LLM prompts not adaptable to different tasks. In contrast, AutoRace automatically creates detailed evaluation criteria tailored for each task, and uses GPT-4 for accurate evaluation following the criteria.
  • We develop LLM Reasoners, a library for standardized modular implementation of existing and new reasoning algorithms, under a unified formulation of the search, reward, and world model components.
  • We conduct extensive empirical study of different reasoning approaches (e.g., CoT, ToT, RAP). The analysis reveals interesting findings about different factors contributing to reasoning, including the reward-guidance, breadth-vs-depth in search, world model, and prompt formats, etc. More results can be found in AutoRace Leaderboard.

AutoRace: Automated Reasoning Chain Evaluation

AutoRace is a fully automated approach for evaluating reasoning chains that adapts to different tasks without human efforts. For each reasoning task (e.g., math reasoning), AutoRace autonomously constructs a detailed evaluation criteria list. This process is shown in the figure below, where the LLM detects the errors in automatically collected incorrect reasoning chains, and summarizes their errors into a criteria list (more details can be found in the paper). Finally, the criteria list is then used to instruct GPT-4 to evaluate any given reasoning chains on the task.

Figure 2

  • Compared to the prefixed human-written prompts (Tyen et al., 2023; He et al., 2023), the AutoRace criteria lists are automatically customized for each task with GPT-4 to ensure accurate evaluation.
  • Compared to fine-tuned models (Golovneva et al., 2022; Xia et al., 2024), AutoRace effectively leverages GPT-4's strong prior knowledge, so that it can learn from only incorrect reasoning chains, which can be collected automatically, instead of human labels of reasoning chains.

  • On a wide range of tasks, AutoRace shows strong correlation with human evaluation, and manages to detect 70.4% of incorrect reasoning chains that cannot be captured by the conventional final-answer-based evaluation.

Try AutoRace in


OpenAI GPTs:

LLM Reasoners: A Unified Formulation and Library

There has been rich research on constructing reasoning chains to solve problems using LLMs, from the simplest CoT prompting (Wei et al., 2022), to tree search algorithms guided by a reward function (Yao et al., 2023; Xie et al., 2023; Hao et al., 2023). Hao et al., 2023 proposed to incorporate a world model into reasoning, which simulates the state of the world. This enables LLMs to reason in a manner close to humans' conscious planning. These methods, among many others, can be formulated as a search process that maximizes the accumulated reward `argmax(a_0,...,a_T) sum_{t=0}^T r(s_t, a_t)`, with a world model that predicts state transition `mathcal{T}(s_{t}|s_{t-1}, a_{t-1})`, and a search algorithm to optimize the objective. In our paper, we discuss how recent reasoning algorithms can be interpreted as special cases within this framework, with specific choices of these three components.

LLM Reasoners implements our unified formulation for multi-step reasoning with a modular design. Users can easily set up a reasoning method by creating a defining the WorldModel and SearchConfig, and importing a Reasoner object, which includes a SearchConfig, a WorldModel class, and a SearchAlgorithm.

from reasoners import SearchConfig, WorldModel, Reasoner
from reasoners.algorithm import MCTS
from reasoners.lm import Llama2Model

class MyWorldModel(WorldModel):
    def step(self, state, action):
        return self.llm.generate(self.next_state_prompt.format(state, action))

class MyConfig(SearchConfig):
    def reward(self, state, action):
        return self.llm.generate(self.eval_prompt.format(state, action))

reasoner = Reasoner(
    world_model=MyWorldModel(), search_config=MyConfig(), search_algo= MCTS()

Our library provides multiple helper classes to facilitate the development and deployment of reasoning algorithms, e.g., rich LLM APIs, multiple search algorithms, standard interface to popular datasets. Especially, it features a visualization tool to aid users in comprehending the reasoning process. Even for the most complex reasoning algorithms like Monte-Carlo Tree Search, users can easily diagnose and understand what happens with one line of python code. Here we show some examples from GSM8K and Blocksworld.

My goal is to have have that the orange block is on top of the blue block. The initial state is that, the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table. What's the action plan to achieve my goal?

No node selected. Click on a node to expand.

Analysis of LLM Step-by-step Reasoning

To better understand multi-step reasoning algorithms and analyze the design elements critical to better reasoning performance, we evaluate them on diverse reasoning datasets, utilizing our AutoRace metric and LLM Reasoners library.
First, we compare several representative reasoning algorithms, including CoT (Wei et al., 2022), ToT (Yao et al., 2023) , and RAP (Hao et al., 2023), on a set of reasoning tasks. Tasks labeled with * applies the AutoRace evaluation for the main results (final answer accuracy is shwon in the brackets for reference). Other tasks are evaluated with oracle reasoning chain evaluators (e.g., a program or simulator), thanks to their nature of close domain.
Figure 3
We have the following observations:
  • Reasoning as reward-guided search helps not only improve final accuracy, but also effectively alleviate false-positive reasoning chains than the depth for most tasks.
  • For efficient search in the reasoning space, the breadth of search is generally more important.
  • Incorporating a world model that explicitly infers reasoning state would effectively improve the LLM reasoning ability, particularly for tasks in embodied environments
  • Inappropriate prompt format design might inadvertently lead to false-positive reasoning chains.
We also compare across diverse LLMs (GPT-4, Claude-3, Gemini, etc.) on their CoT reasoning chains. We present the results in the figure below, where the LLMs are ranked by their average scores.
Figure 4
  • GPT-4 turbo and Claude-3 Opus are the two with the strongest reasoning abilities, and they lead on almost every reasoning task. We also notice the ranking of Top-3 models is aligned with ChatArena leaderboard, which indicates that the reasoning ability is indeed crucial to power the SOTA chatbot.
  • Top models have achieved remarkable performance on math word problems (GSM8k) and commonsense reasoning (StrategyQA), but reasoning tasks that require strong planning abilities (e.g., Game-24 and Blocksworld) remain unsolved, which leaves room for future research.


    title={LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models},
    author={Hao*, Shibo and Gu*, Yi and Luo*, Haotian and Liu, Tianyang and Shao, Xiyan and Wang, Xinyuan and Xie, Shuhua and Ma, Haodi and Samavedhi, Adithya and Gao, Qiyue and others},
    journal={arXiv preprint arXiv:2404.05221},
Presented by