Evaluating whether multi-agent debate improves reasoning in large language models
Large language models often struggle with complex reasoning tasks. Recent research suggests that structured debate between multiple agents may improve reasoning quality by exposing incorrect arguments. In this project we implement a multi-agent debate system consisting of two debaters and a judge. The system is evaluated on the StrategyQA dataset. We compare debate reasoning against two baseline methods: Direct QA and Self Consistency.
Our results show that although debate produces structured reasoning, it does not always outperform simpler baselines.
The proposed system implements a multi-agent debate framework designed to improve reasoning quality of large language models. The system consists of three agents: Debater A, Debater B, and a Judge. Each debater produces an initial answer and then participates in a structured debate where agents challenge and respond to each other's reasoning. After the debate rounds, a judge agent analyzes the full transcript and determines the final answer.
The overall architecture follows a four-stage pipeline:
The system is implemented using a modular Python architecture. The repository is organized into several components responsible for agent logic, debate control, utilities, and experimental evaluation.
LLM_Debate_and_Judge_Pipeline
agents/
debater.py
judge.py
debate/
debate_orchestrator.py
experiments/
run_debate.py
run_direct_qa.py
run_self_consistency.py
run_jury.py
utils/
answer_utils.py
config_loader.py
config/
config.yaml
logs/
experiment results
This modular design allows easy modification of agents, debate parameters, and evaluation pipelines.
Each debate begins with both debaters generating an initial YES or NO answer. The debate then proceeds through multiple rounds where agents respond to each other's arguments using the previous transcript as context.
To reduce unnecessary debate rounds, an adaptive stopping rule is used. If both agents produce the same answer for two consecutive rounds, the debate terminates early.
The system uses large language models accessed through the OpenAI API. These models were selected because they demonstrate strong natural language reasoning abilities and can generate structured arguments.
Using the same model for both debaters ensures fairness and allows the debate structure itself to influence reasoning outcomes.
All experimental parameters are defined in a configuration file (config/config.yaml). This ensures reproducibility and allows easy adjustment of debate settings.
Key parameters include:
These parameters were selected to balance reasoning quality and computational efficiency.
To evaluate the effectiveness of the debate framework, we conducted experiments on the StrategyQA dataset. StrategyQA is a benchmark dataset that requires multi-step reasoning and implicit knowledge to answer binary (YES/NO) questions.
We selected a subset of 100 questions for evaluation. Each question includes a ground-truth answer used to compute accuracy.
We compare three reasoning methods:
All experiments are implemented in Python and executed using the OpenAI API. The system is modular and consists of separate scripts for each method:
experiments/run_debate.py experiments/run_direct_qa.py experiments/run_self_consistency.py
Results are stored as JSON files for reproducibility:
logs/debate_results.json logs/direct_qa_results.json logs/self_consistency_results.json---
We use accuracy as the main evaluation metric. Accuracy is computed as:
Accuracy = (Number of Correct Predictions) / (Total Number of Questions)
Predictions are extracted from model outputs and compared with the ground-truth labels from the dataset.
---The debate system is configured using a YAML configuration file. Key parameters include:
Each debate produces a full transcript, which is later evaluated by the judge.
---For Direct QA, the model generates a single response for each question.
For Self Consistency, multiple responses are generated using temperature sampling, and the final answer is determined by majority vote.
---The following table shows the accuracy of each method:
| Method | Accuracy (v1) | Accuracy (v2) |
|---|---|---|
| Debate | 0.48 | 0.52 |
| Direct QA | 0.74 | 0.71 |
| Self Consistency | 0.74 | 0.71 |
The improved prompt design (v2) increases debate accuracy from 0.48 to 0.52. This shows that prompt engineering has a positive impact on multi-agent reasoning.
Interestingly, baseline methods show a slight decrease in accuracy. This suggests that debate benefits more from structured prompting compared to single-step methods.
---The debate pipeline achieves lower accuracy compared to both baseline methods. This indicates that structured debate does not always improve reasoning performance.
Direct QA and Self Consistency achieve higher accuracy. Self Consistency benefits from sampling multiple independent answers, which reduces random errors.
In contrast, the debate system introduces additional reasoning steps that may propagate incorrect assumptions across multiple rounds.
---During evaluation, several types of errors were observed:
These errors contribute to the lower performance of the debate system.
---Although the dataset size is relatively small (100 questions), the difference between debate accuracy (0.48) and baseline methods (0.74) is significant.
This suggests that the observed performance gap is not due to random variation, but reflects systematic differences between reasoning approaches.
A larger dataset could provide more reliable statistical validation, but the current results already indicate a clear trend.
---Overall, the experimental results show that:
These findings highlight important limitations of multi-agent debate systems.
To better understand the behavior of the debate system, we analyze several debate transcripts collected during experiments. These examples highlight both successful reasoning cases and common failure modes.
---Question: Are kayaks used at the summit of Mount Everest?
In this example, Debater B consistently argues that kayaks cannot be used because there is no water at the summit and the environment is extremely cold. The reasoning is grounded in physical constraints and real-world knowledge.
Even though Debater A fails to produce an argument, the judge correctly identifies the stronger reasoning and returns the correct answer: NO.
This example shows that debate can succeed when at least one agent provides clear and correct reasoning.
---Observation: Debater A produced no argument.
In several transcripts, one debater failed to generate any response and returned "No argument generated". This leads to a one-sided debate where only one agent provides reasoning.
As a result, the judge has no alternative perspective to compare, and the final decision depends entirely on a single argument. This reduces the effectiveness of the debate framework.
This failure mode highlights the importance of robust prompt design to ensure both agents actively participate in the debate.
---Question: Can Aerosmith fit in a 2020 Mitsubishi Outlander?
In this example, both debaters focus on the cargo capacity and equipment size. However, they introduce approximate and potentially incorrect numerical values for volume and weight.
The debate becomes centered around these assumptions rather than verifying the actual facts. As a result, the judge selects the stronger argument, but the reasoning itself may still be flawed.
This demonstrates a key limitation of debate systems: incorrect assumptions can propagate through multiple rounds and influence the final decision.
---In some cases, the judge output does not follow a strict YES/NO format, which leads to extraction failures and "unknown" predictions.
This type of error is not related to reasoning ability, but rather to output formatting. It highlights the importance of enforcing structured outputs in prompt design.
---Irving et al. (2018) proposed debate as a method to improve AI alignment by allowing agents to challenge each other's reasoning. The key idea is that incorrect arguments can be exposed through adversarial interaction.
Our experimental results partially support this theory. In some cases, strong arguments successfully dominate weak ones and lead to correct answers.
However, our findings also reveal limitations:
These observations suggest that while debate can improve reasoning in theory, its practical effectiveness depends heavily on prompt design and agent reliability.
h3>Multi-Agent Judge Panel AnalysisWe observe that using multiple judges improves decision stability. When all judges agree, the answer is usually correct.
Disagreement between judges often occurs in difficult or ambiguous questions. This suggests that disagreement can be used as a signal of question difficulty.
Overall, the jury mechanism improves reliability compared to a single judge.
---The qualitative analysis shows that multi-agent debate has both strengths and weaknesses. It can produce structured reasoning, but also introduces new failure modes such as one-sided debates and error propagation.
After prompt improvements (v2), we observed fewer cases of missing arguments and more consistent debate participation. This directly contributed to higher accuracy.
In this project, prompt design plays a critical role in controlling the behavior of the debate agents. We iteratively improved prompts based on observed failures in early experiments.
---In the first version (v1), we used simple role-based prompts for Debater A, Debater B, and the Judge.
Debater A was instructed to argue "YES", and Debater B was instructed to argue "NO". Both debaters were asked to provide a short answer and 2–3 sentences of reasoning.
Debater A: Answer: YES Reasoning: ... Debater B: Answer: NO Reasoning: ...
The judge was given the full debate transcript and asked to select the winner:
Winner: A or B Explanation: ...---
The initial prompts were simple, but several issues appeared during experiments:
These problems directly affected performance, resulting in a low debate accuracy of 0.48, compared to 0.74 for baseline methods.
---After improving the prompts, debate accuracy increased from 0.48 to 0.52.
This improvement confirms that better prompt design:
Although the improvement is moderate, it demonstrates that prompt engineering is a key factor in multi-agent systems.
---1. Role Framing
Each agent is assigned a fixed role (YES vs NO). This creates adversarial interaction and encourages diverse reasoning paths.
2. Chain-of-Thought Style Reasoning
We encouraged step-by-step reasoning instead of short answers. This improves clarity and helps the judge compare arguments.
3. Output Format Constraints
Strict output formats were used to ensure reliable parsing of model outputs. This reduces errors such as "unknown" predictions.
4. Robustness Against Failure
We modified prompts to reduce the chance of missing arguments and ensure both agents actively participate in the debate.
---Our prompt engineering process followed an iterative loop:
This process allowed us to systematically improve the debate system.
---We expect the improved prompts to:
Prompt engineering is essential for multi-agent systems. Our results show that simple prompts are not sufficient, and careful design is required to achieve reliable performance.
This section contains the final prompt templates used in our experiments (v2).
Placeholders such as {question} and {transcript}
are dynamically replaced during execution.
You are Debater A.
Question:
{question}
Your task:
Argue that the answer is YES.
Rules:
- You MUST answer YES
- Provide clear reasoning
- Use step-by-step thinking if needed
- Be confident and persuasive
- Do NOT say "I don't know"
Respond in this format:
Answer:
YES
Reasoning:
- Step 1:
- Step 2:
- Step 3:
You are Debater B.
Question:
{question}
Your task:
Argue that the answer is NO.
Rules:
- You MUST answer NO
- Provide clear reasoning
- Use step-by-step thinking if needed
- Be confident and persuasive
- Challenge possible YES arguments
Respond in this format:
Answer:
NO
Reasoning:
- Step 1:
- Step 2:
- Step 3:
Question:
{question}
Debate transcript:
{transcript}
Your task:
Decide which debater gave the stronger argument.
Rules:
- Focus on logic and evidence
- Ignore style or length
- Choose the more correct reasoning
Respond in this format:
Winner:
A or B
Final Answer:
YES or NO
Explanation:
Short explanation.