Multi-Agent Debate for StrategyQA

Evaluating whether multi-agent debate improves reasoning in large language models

Abstract

Large language models often struggle with complex reasoning tasks. Recent research suggests that structured debate between multiple agents may improve reasoning quality by exposing incorrect arguments. In this project we implement a multi-agent debate system consisting of two debaters and a judge. The system is evaluated on the StrategyQA dataset. We compare debate reasoning against two baseline methods: Direct QA and Self Consistency.

Our results show that although debate produces structured reasoning, it does not always outperform simpler baselines.

1. Methodology

System Architecture

The proposed system implements a multi-agent debate framework designed to improve reasoning quality of large language models. The system consists of three agents: Debater A, Debater B, and a Judge. Each debater produces an initial answer and then participates in a structured debate where agents challenge and respond to each other's reasoning. After the debate rounds, a judge agent analyzes the full transcript and determines the final answer.

The overall architecture follows a four-stage pipeline:

Project Structure

The system is implemented using a modular Python architecture. The repository is organized into several components responsible for agent logic, debate control, utilities, and experimental evaluation.


LLM_Debate_and_Judge_Pipeline

agents/
    debater.py
    judge.py

debate/
    debate_orchestrator.py

experiments/
    run_debate.py
    run_direct_qa.py
    run_self_consistency.py
    run_jury.py

utils/
    answer_utils.py
    config_loader.py

config/
    config.yaml

logs/
    experiment results

This modular design allows easy modification of agents, debate parameters, and evaluation pipelines.

Debate Protocol

Each debate begins with both debaters generating an initial YES or NO answer. The debate then proceeds through multiple rounds where agents respond to each other's arguments using the previous transcript as context.

To reduce unnecessary debate rounds, an adaptive stopping rule is used. If both agents produce the same answer for two consecutive rounds, the debate terminates early.

Model Choices

The system uses large language models accessed through the OpenAI API. These models were selected because they demonstrate strong natural language reasoning abilities and can generate structured arguments.

Using the same model for both debaters ensures fairness and allows the debate structure itself to influence reasoning outcomes.

Configuration and Hyperparameters

All experimental parameters are defined in a configuration file (config/config.yaml). This ensures reproducibility and allows easy adjustment of debate settings.

Key parameters include:

These parameters were selected to balance reasoning quality and computational efficiency.

2. Experiments

Experimental Setup

To evaluate the effectiveness of the debate framework, we conducted experiments on the StrategyQA dataset. StrategyQA is a benchmark dataset that requires multi-step reasoning and implicit knowledge to answer binary (YES/NO) questions.

We selected a subset of 100 questions for evaluation. Each question includes a ground-truth answer used to compute accuracy.

We compare three reasoning methods:

---

Implementation Details

All experiments are implemented in Python and executed using the OpenAI API. The system is modular and consists of separate scripts for each method:

experiments/run_debate.py
experiments/run_direct_qa.py
experiments/run_self_consistency.py

Results are stored as JSON files for reproducibility:

logs/debate_results.json
logs/direct_qa_results.json
logs/self_consistency_results.json
---

Evaluation Metric

We use accuracy as the main evaluation metric. Accuracy is computed as:

Accuracy = (Number of Correct Predictions) / (Total Number of Questions)

Predictions are extracted from model outputs and compared with the ground-truth labels from the dataset.

---

Debate Configuration

The debate system is configured using a YAML configuration file. Key parameters include:

Each debate produces a full transcript, which is later evaluated by the judge.

---

Baseline Configuration

For Direct QA, the model generates a single response for each question.

For Self Consistency, multiple responses are generated using temperature sampling, and the final answer is determined by majority vote.

---

Results

The following table shows the accuracy of each method:

Method Accuracy (v1) Accuracy (v2)
Debate 0.48 0.52
Direct QA 0.74 0.71
Self Consistency 0.74 0.71

The improved prompt design (v2) increases debate accuracy from 0.48 to 0.52. This shows that prompt engineering has a positive impact on multi-agent reasoning.

Interestingly, baseline methods show a slight decrease in accuracy. This suggests that debate benefits more from structured prompting compared to single-step methods.

---

Observations

The debate pipeline achieves lower accuracy compared to both baseline methods. This indicates that structured debate does not always improve reasoning performance.

Direct QA and Self Consistency achieve higher accuracy. Self Consistency benefits from sampling multiple independent answers, which reduces random errors.

In contrast, the debate system introduces additional reasoning steps that may propagate incorrect assumptions across multiple rounds.

---

Error Analysis (Quantitative)

During evaluation, several types of errors were observed:

These errors contribute to the lower performance of the debate system.

---

Statistical Considerations

Although the dataset size is relatively small (100 questions), the difference between debate accuracy (0.48) and baseline methods (0.74) is significant.

This suggests that the observed performance gap is not due to random variation, but reflects systematic differences between reasoning approaches.

A larger dataset could provide more reliable statistical validation, but the current results already indicate a clear trend.

---

Summary

Overall, the experimental results show that:

These findings highlight important limitations of multi-agent debate systems.

3. Analysis

To better understand the behavior of the debate system, we analyze several debate transcripts collected during experiments. These examples highlight both successful reasoning cases and common failure modes.

---

Case 1: Successful Debate

Question: Are kayaks used at the summit of Mount Everest?

In this example, Debater B consistently argues that kayaks cannot be used because there is no water at the summit and the environment is extremely cold. The reasoning is grounded in physical constraints and real-world knowledge.

Even though Debater A fails to produce an argument, the judge correctly identifies the stronger reasoning and returns the correct answer: NO.

This example shows that debate can succeed when at least one agent provides clear and correct reasoning.

---

Case 2: One-Sided Debate (Failure Mode)

Observation: Debater A produced no argument.

In several transcripts, one debater failed to generate any response and returned "No argument generated". This leads to a one-sided debate where only one agent provides reasoning.

As a result, the judge has no alternative perspective to compare, and the final decision depends entirely on a single argument. This reduces the effectiveness of the debate framework.

This failure mode highlights the importance of robust prompt design to ensure both agents actively participate in the debate.

---

Case 3: Incorrect Assumption Propagation

Question: Can Aerosmith fit in a 2020 Mitsubishi Outlander?

In this example, both debaters focus on the cargo capacity and equipment size. However, they introduce approximate and potentially incorrect numerical values for volume and weight.

The debate becomes centered around these assumptions rather than verifying the actual facts. As a result, the judge selects the stronger argument, but the reasoning itself may still be flawed.

This demonstrates a key limitation of debate systems: incorrect assumptions can propagate through multiple rounds and influence the final decision.

---

Case 4: Output Format Failure

In some cases, the judge output does not follow a strict YES/NO format, which leads to extraction failures and "unknown" predictions.

This type of error is not related to reasoning ability, but rather to output formatting. It highlights the importance of enforcing structured outputs in prompt design.

---

Connection to Debate Theory (Irving et al.)

Irving et al. (2018) proposed debate as a method to improve AI alignment by allowing agents to challenge each other's reasoning. The key idea is that incorrect arguments can be exposed through adversarial interaction.

Our experimental results partially support this theory. In some cases, strong arguments successfully dominate weak ones and lead to correct answers.

However, our findings also reveal limitations:

These observations suggest that while debate can improve reasoning in theory, its practical effectiveness depends heavily on prompt design and agent reliability.

h3>Multi-Agent Judge Panel Analysis

We observe that using multiple judges improves decision stability. When all judges agree, the answer is usually correct.

Disagreement between judges often occurs in difficult or ambiguous questions. This suggests that disagreement can be used as a signal of question difficulty.

Overall, the jury mechanism improves reliability compared to a single judge.

---

Summary

The qualitative analysis shows that multi-agent debate has both strengths and weaknesses. It can produce structured reasoning, but also introduces new failure modes such as one-sided debates and error propagation.

After prompt improvements (v2), we observed fewer cases of missing arguments and more consistent debate participation. This directly contributed to higher accuracy.

4. Prompt Engineering

In this project, prompt design plays a critical role in controlling the behavior of the debate agents. We iteratively improved prompts based on observed failures in early experiments.

---

Initial Design (v1)

In the first version (v1), we used simple role-based prompts for Debater A, Debater B, and the Judge.

Debater A was instructed to argue "YES", and Debater B was instructed to argue "NO". Both debaters were asked to provide a short answer and 2–3 sentences of reasoning.

Debater A:
Answer: YES
Reasoning: ...

Debater B:
Answer: NO
Reasoning: ...

The judge was given the full debate transcript and asked to select the winner:

Winner: A or B
Explanation: ...
---

Problems Observed in v1

The initial prompts were simple, but several issues appeared during experiments:

These problems directly affected performance, resulting in a low debate accuracy of 0.48, compared to 0.74 for baseline methods.

---

Results After Iteration (v2)

After improving the prompts, debate accuracy increased from 0.48 to 0.52.

This improvement confirms that better prompt design:

Although the improvement is moderate, it demonstrates that prompt engineering is a key factor in multi-agent systems.

---

Key Design Decisions

1. Role Framing

Each agent is assigned a fixed role (YES vs NO). This creates adversarial interaction and encourages diverse reasoning paths.

2. Chain-of-Thought Style Reasoning

We encouraged step-by-step reasoning instead of short answers. This improves clarity and helps the judge compare arguments.

3. Output Format Constraints

Strict output formats were used to ensure reliable parsing of model outputs. This reduces errors such as "unknown" predictions.

4. Robustness Against Failure

We modified prompts to reduce the chance of missing arguments and ensure both agents actively participate in the debate.

---

Iteration Strategy

Our prompt engineering process followed an iterative loop:

  1. Run experiments
  2. Analyze failures (logs and transcripts)
  3. Identify weak points in prompts
  4. Modify instructions and constraints
  5. Re-run experiments

This process allowed us to systematically improve the debate system.

---

Expected Impact of v2

We expect the improved prompts to:

---

Summary

Prompt engineering is essential for multi-agent systems. Our results show that simple prompts are not sufficient, and careful design is required to achieve reliable performance.

Appendix: Full Prompts

This section contains the final prompt templates used in our experiments (v2). Placeholders such as {question} and {transcript} are dynamically replaced during execution.

---
Debater A Prompt (YES)
You are Debater A.

Question:
{question}

Your task:
Argue that the answer is YES.

Rules:
- You MUST answer YES
- Provide clear reasoning
- Use step-by-step thinking if needed
- Be confident and persuasive
- Do NOT say "I don't know"

Respond in this format:

Answer:
YES

Reasoning:
- Step 1:
- Step 2:
- Step 3:
---
Debater B Prompt (NO)
You are Debater B.

Question:
{question}

Your task:
Argue that the answer is NO.

Rules:
- You MUST answer NO
- Provide clear reasoning
- Use step-by-step thinking if needed
- Be confident and persuasive
- Challenge possible YES arguments

Respond in this format:

Answer:
NO

Reasoning:
- Step 1:
- Step 2:
- Step 3:
---
Judge Prompt
Question:
{question}

Debate transcript:
{transcript}

Your task:
Decide which debater gave the stronger argument.

Rules:
- Focus on logic and evidence
- Ignore style or length
- Choose the more correct reasoning

Respond in this format:

Winner:
A or B

Final Answer:
YES or NO

Explanation:
Short explanation.
---

Notes