Introduction
A single language model, no matter how capable, can hallucinate, make reasoning errors, and exhibit hidden biases. The traditional solution in software engineering has always been peer review—multiple experts independently evaluate the same work, critique each other’s conclusions, and converge on a better answer.
LLM Councils apply this same principle to AI systems: multiple language models independently reason about the same task, critique each other’s outputs, and converge on a higher-quality final answer through structured aggregation.
The core insight:
Individual models are fallible. Structured disagreement makes them reliable.
Think of it as:
- Code review for AI outputs
- Peer review for reasoning
- Adversarial collaboration at scale
- Ensemble learning with explicit critique
Why this matters:
Traditional ensemble methods (like voting or averaging) treat models as black boxes—they combine outputs without understanding why models disagree. LLM Councils make disagreement explicit and productive:
| Single Model | Traditional Ensemble | LLM Council |
|---|---|---|
| One perspective | Multiple votes | Multiple perspectives + critique |
| Hidden errors | Majority vote masks errors | Errors surfaced through critique |
| No self-correction | No reasoning about disagreement | Explicit reasoning about differences |
| Fast but fallible | Faster, slightly better | Slower, significantly more reliable |
When LLM Councils shine:
- High-stakes decisions (security analysis, legal reasoning)
- Low tolerance for hallucinations (medical advice, compliance checks)
- Complex reasoning tasks (system design, architectural decisions)
- Verification requirements (code review, policy validation)
What you’ll learn:
- Why single models fail and how councils address these failures
- The canonical architecture and four-phase execution model
- Design patterns (homogeneous, heterogeneous, specialist councils)
- Production implementation with real code examples
- Cost/latency tradeoffs and when NOT to use councils
- Security considerations and failure modes
This guide takes you from understanding the fundamentals to deploying production-grade LLM Council systems.
1. Why LLM Councils Exist
The single-model failure modes
Single-model LLM systems suffer from fundamental limitations that become critical in production environments:
1. Confident hallucinations
Models can generate plausible-sounding but completely false information with high confidence:
# Example: Single model analyzing a security vulnerability
Query: "Is this authentication implementation secure?"
Single Model Response:
"Yes, this implementation is secure. It uses JWT tokens with HS256 signing,
stores tokens in localStorage, and validates on every request. This follows
industry best practices."
# Problems:
# - localStorage is vulnerable to XSS attacks (should use httpOnly cookies)
# - HS256 with shared secrets has key distribution issues (should use RS256)
# - No mention of token rotation or expiration
# - Confidently stated as "secure" and "best practices"
Why this happens:
- Models optimize for fluency, not correctness
- Training data contains both correct and incorrect information
- No built-in verification mechanism
- Confidence is a property of generation, not accuracy
2. Hidden reasoning errors
Models can make logical mistakes that aren’t obvious from the output:
# Example: Analyzing cascading system failures
Query: "If Service A fails, what services are affected?"
Single Model Response:
"Service B and Service C depend on Service A, so they will fail.
Service D is independent and will continue operating."
# Hidden error:
# - Service D actually depends on Service C (indirect dependency)
# - The model only checked direct dependencies
# - Service D will eventually fail, but model missed the transitive relationship
Why this happens:
- Models don’t explicitly show their reasoning chain
- Multi-hop reasoning is error-prone
- No way to validate intermediate steps
- Transitive relationships are harder to track
3. Prompt sensitivity
Small changes in phrasing can produce dramatically different answers:
# Example: Code review
Prompt A: "Review this code for security issues"
Response: Identifies 2 issues
Prompt B: "Is this code secure?"
Response: "Yes, looks good"
Prompt C: "What security vulnerabilities exist in this code?"
Response: Identifies 5 issues
# Same code, different framings → inconsistent analysis
Why this happens:
- Models are sensitive to prompt framing
- Different phrasings activate different training patterns
- No canonical representation of the task
- Context window effects vary with prompt structure
4. Overfitting to one reasoning style
Each model has training biases that shape how it approaches problems:
# Example: Different models, different styles
GPT-4: Tends toward detailed step-by-step reasoning
Claude: Tends toward structured analysis with clear sections
Gemini: Tends toward conceptual overviews with examples
# For the same task, you get different strengths:
# - GPT-4 might catch implementation details
# - Claude might identify architectural issues
# - Gemini might spot conceptual misunderstandings
# Using only one model means missing perspectives
How human organizations solve this
The pattern is universal across domains:
Academic research:
- Papers go through peer review
- Multiple reviewers independently evaluate
- Authors respond to critiques
- Editors synthesize feedback
Software engineering:
- Code goes through pull request review
- Multiple engineers review independently
- Discussion resolves disagreements
- Maintainer makes final decision
Legal systems:
- Multiple judges or jurors
- Adversarial presentation of arguments
- Deliberation to resolve differences
- Majority or consensus decision
Medical diagnosis:
- Second opinions for serious conditions
- Tumor boards with multiple specialists
- Differential diagnosis with multiple hypotheses
- Consensus on treatment plan
The LLM Council parallel
LLM Council = automated peer review for AI outputs
Instead of hoping one model gets it right, structure the process:
# Single model approach
answer = model.generate(query)
# Hope it's correct
# LLM Council approach
answers = [model_a.generate(query),
model_b.generate(query),
model_c.generate(query)]
critiques = [model.critique(answers) for model in models]
final = chairman.synthesize(answers, critiques)
# Confidence through structured disagreement
Key differences from naive ensembling:
| Naive Ensemble | LLM Council |
|---|---|
| Average outputs | Explicit critique |
| Vote for majority | Reason about disagreements |
| No error analysis | Surface and discuss errors |
| Black box combination | Transparent decision process |
The fundamental principle:
Disagreement is not a problem to eliminate—it’s a signal to investigate.
When multiple intelligent systems disagree, one of them is wrong. The council structure makes that disagreement productive by forcing explicit analysis of why they differ.
2. Core Idea (Mental Model)
The three-phase mental model
Think of an LLM Council as structured collaboration through three distinct phases:
Phase 1: Independent Reasoning (Isolation)
↓
Phase 2: Mutual Critique (Adversarial)
↓
Phase 3: Consensus Synthesis (Integration)
Critical constraints:
No shared context during generation
- Each model sees only the original query
- No access to other models’ responses
- Prevents groupthink and answer copying
- Forces truly independent reasoning
No copying answers
- Models can’t just agree with the first answer
- Must produce original analysis
- Critique phase identifies copied reasoning
- Preserves diversity of thought
No averaging tokens
- Not a voting system
- Not probability averaging
- Explicit reasoning about differences
- Synthesis, not statistical combination
Why independence matters
The parallel to human review:
Imagine a code review where:
- Reviewer 2 reads Reviewer 1’s comments before reviewing
- Reviewer 3 reads both previous reviews
What happens:
- Anchoring bias (first review dominates)
- Groupthink (later reviewers conform)
- Missing perspectives (everyone follows first frame)
- False consensus (agreement without diversity)
LLM Councils prevent this:
# Bad: Sequential review with shared context
review_1 = model_a.review(code)
review_2 = model_b.review(code, context=review_1) # ❌ Biased
review_3 = model_c.review(code, context=[review_1, review_2]) # ❌ Groupthink
# Good: Independent review, then comparison
review_1 = model_a.review(code) # ✓ Independent
review_2 = model_b.review(code) # ✓ Independent
review_3 = model_c.review(code) # ✓ Independent
# Now compare and critique
critiques = compare_reviews([review_1, review_2, review_3])
The critique phase is adversarial by design
Human parallel: Academic peer review
When you submit a paper for review:
- Reviewers read independently
- Each reviewer critiques from their expertise
- Reviewers disagree (this is healthy)
- Editor weighs critiques and decides
LLM Council parallel:
# Phase 1: Independent generation
answers = {
"model_a": "Solution X because of reasoning A",
"model_b": "Solution Y because of reasoning B",
"model_c": "Solution X because of reasoning C"
}
# Phase 2: Each model critiques others
critique_by_a = model_a.critique(answers["model_b"], answers["model_c"])
# "Model B's approach has flaw F. Model C's reasoning is sound but incomplete."
critique_by_b = model_b.critique(answers["model_a"], answers["model_c"])
# "Model A and C agree but both missed consideration G."
critique_by_c = model_c.critique(answers["model_a"], answers["model_b"])
# "Model B identified issue I that Model A overlooked. However..."
# Phase 3: Chairman synthesizes
final = chairman.synthesize(answers, all_critiques)
# "Combining Model A and C's approach with Model B's insight about issue I..."
Key insight: Disagreement surfaces errors that wouldn’t be caught by any single model.
What synthesis means (not averaging)
Bad synthesis (statistical):
# ❌ Token-level averaging
final_tokens = average([model_a_tokens, model_b_tokens, model_c_tokens])
# Result: Nonsensical mix of contradictory phrases
Good synthesis (reasoning):
# ✓ Explicit reasoning about disagreements
def synthesize(answers, critiques):
"""
Chairman model reasons:
1. Where do models agree? (High confidence)
2. Where do they disagree? (Investigate why)
3. Which critiques are valid? (Evaluate evidence)
4. What's the best final answer? (Integrate insights)
"""
agreements = find_consensus(answers)
disagreements = find_conflicts(answers)
validated_critiques = evaluate_critiques(critiques)
return construct_answer(
consensus_points=agreements,
resolved_conflicts=resolve(disagreements, validated_critiques),
novel_insights=extract_unique_contributions(answers, critiques)
)
The chairman is not a voting system:
# ❌ Simple voting
final = majority_vote([answer_a, answer_b, answer_c])
# Loses nuance, misses partial correctness
# ✓ Reasoned synthesis
final = chairman.reason_about(
answers=[answer_a, answer_b, answer_c],
critiques=[critique_a_of_others, critique_b_of_others, critique_c_of_others],
task="Produce the best answer by integrating valid points and resolving conflicts"
)
# Preserves nuance, combines insights
The mental model in practice
Think of the council as a meeting:
Independent preparation (Phase 1)
- Everyone reads the problem
- Everyone prepares their analysis
- No talking beforehand
Round-robin presentation (Phase 2)
- Each person presents their view
- Others listen and take notes
- Identify agreements and disagreements
Discussion (Phase 2 continued)
- Point out flaws in reasoning
- Highlight missed considerations
- Defend or revise positions
Decision (Phase 3)
- Chair synthesizes discussion
- Integrates best ideas
- Resolves remaining conflicts
- Delivers final recommendation
The difference: This meeting happens in parallel at LLM speeds, with perfect recall and structured output.
3. Canonical LLM Council Architecture
High-level architecture diagram
┌──────────┐
User Query ───▶ │ Orchestrator │
└────┬─────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Model A │ │ Model B │ │ Model C │ Phase 1: Independent Generation
└────┬────┘ └────┬────┘ └────┬────┘ (Parallel execution)
│ │ │
└──────┬────────┴────────┬──────┘
│ │
┌───────▼─────────────────▼───────┐
│ Cross-Critique Matrix │ Phase 2: Mutual Critique
│ A critiques [B, C] │ (Each model reviews others)
│ B critiques [A, C] │
│ C critiques [A, B] │
└────────────┬────────────────────┘
│
┌───────▼────────┐
│ Scoring Engine │ Phase 3: Evaluation
│ (Optional) │ (Quantitative ranking)
└───────┬────────┘
│
┌───────▼────────┐
│ Chairman Model │ Phase 4: Synthesis
│ (Integrator) │ (Final decision)
└───────┬────────┘
│
▼
Final Answer
Component breakdown
1. Orchestrator (Controller)
Manages the entire council process:
class CouncilOrchestrator:
def __init__(self, models: List[LLM], chairman: LLM):
self.models = models
self.chairman = chairman
async def process(self, query: str) -> CouncilResult:
# Phase 1: Independent generation
answers = await self.generate_phase(query)
# Phase 2: Cross-critique
critiques = await self.critique_phase(answers)
# Phase 3: Optional scoring
scores = await self.score_phase(answers, critiques)
# Phase 4: Chairman synthesis
final = await self.synthesis_phase(query, answers, critiques, scores)
return CouncilResult(
final_answer=final,
member_answers=answers,
critiques=critiques,
scores=scores,
metadata=self._build_metadata()
)
2. Council Members (Models A, B, C, …)
Each member has a clear contract:
class CouncilMember:
def __init__(self, model: LLM, role: str = "general"):
self.model = model
self.role = role # "reasoner", "verifier", "devil's advocate", etc.
async def generate(self, query: str) -> Answer:
"""
Phase 1: Generate independent answer
- No context from other models
- Complete reasoning chain
- Explicit confidence levels
"""
prompt = self._build_generation_prompt(query, self.role)
return await self.model.generate(prompt)
async def critique(self, answers: List[Answer]) -> Critique:
"""
Phase 2: Critique other answers
- Identify errors
- Evaluate reasoning
- Suggest improvements
"""
prompt = self._build_critique_prompt(answers, self.role)
return await self.model.generate(prompt)
3. Critique Matrix
Structured representation of all critiques:
class CritiqueMatrix:
"""
Stores N x (N-1) critiques where each model reviews all others
Example for 3 models:
matrix = {
"model_a": {
"critique_of_b": Critique(...),
"critique_of_c": Critique(...)
},
"model_b": {
"critique_of_a": Critique(...),
"critique_of_c": Critique(...)
},
"model_c": {
"critique_of_a": Critique(...),
"critique_of_b": Critique(...)
}
}
"""
def __init__(self, models: List[str]):
self.matrix = {
model: {other: None for other in models if other != model}
for model in models
}
def add_critique(self, critic: str, target: str, critique: Critique):
self.matrix[critic][target] = critique
def get_critiques_of(self, model: str) -> List[Critique]:
"""Get all critiques that target this model"""
return [
self.matrix[critic][model]
for critic in self.matrix
if model in self.matrix[critic]
]
def get_critiques_by(self, model: str) -> List[Critique]:
"""Get all critiques made by this model"""
return list(self.matrix[model].values())
4. Chairman Model (Synthesizer)
The final decision maker:
class Chairman:
def __init__(self, model: LLM):
self.model = model
async def synthesize(
self,
query: str,
answers: List[Answer],
critiques: CritiqueMatrix,
scores: Optional[Dict] = None
) -> FinalAnswer:
"""
Produces final answer by:
1. Analyzing all member answers
2. Weighing all critiques
3. Identifying consensus areas
4. Resolving conflicts
5. Integrating best insights
"""
prompt = self._build_synthesis_prompt(
query=query,
answers=self._anonymize_answers(answers), # Hide model identity
critiques=critiques,
scores=scores
)
return await self.model.generate(prompt)
def _anonymize_answers(self, answers: List[Answer]) -> List[Answer]:
"""
Remove model identities to prevent bias
Answers labeled as "Response A", "Response B", etc.
"""
return [
Answer(content=a.content, label=f"Response {chr(65+i)}")
for i, a in enumerate(answers)
]
5. Data structures
from dataclasses import dataclass
from typing import List, Dict, Optional
@dataclass
class Answer:
content: str
reasoning: Optional[str] = None
confidence: Optional[float] = None
model_id: Optional[str] = None
@dataclass
class Critique:
target_answer: str # Which answer is being critiqued
identified_errors: List[str]
strengths: List[str]
suggestions: List[str]
severity: str # "critical", "major", "minor", "none"
@dataclass
class CouncilResult:
final_answer: str
member_answers: List[Answer]
critiques: CritiqueMatrix
scores: Optional[Dict]
metadata: Dict # Timing, costs, model versions, etc.
confidence: float
def explain(self) -> str:
"""Generate explanation of how decision was reached"""
return f"""
Council Decision Process:
Member Answers: {len(self.member_answers)}
Total Critiques: {len(self.critiques.all())}
Consensus Areas: {self._find_consensus()}
Resolved Conflicts: {self._find_conflicts()}
Final Confidence: {self.confidence:.2%}
"""
Key architectural principles
1. Isolation during generation
# Bad: Sequential generation with context leakage
answer_a = await model_a.generate(query)
answer_b = await model_b.generate(query, context=answer_a) # ❌
answer_c = await model_c.generate(query, context=[answer_a, answer_b]) # ❌
# Good: Parallel independent generation
answers = await asyncio.gather(
model_a.generate(query),
model_b.generate(query),
model_c.generate(query)
) # ✓ True independence
2. Structured critique phase
# Bad: Unstructured critique
critique = "Model B's answer is wrong" # No detail, unhelpful
# Good: Structured critique
critique = Critique(
target_answer="model_b",
identified_errors=[
"Line 15 assumes X but documentation shows Y",
"Missing consideration of edge case Z"
],
strengths=[
"Correct identification of main issue",
"Clear reasoning chain"
],
suggestions=[
"Verify assumption X against source material",
"Add handling for edge case Z"
],
severity="major"
) # ✓ Actionable, specific
3. Anonymous synthesis
# Bad: Chairman knows which model produced which answer
final = chairman.synthesize(
answers=[
("GPT-4", answer_a), # ❌ Model identity bias
("Claude", answer_b),
("Gemini", answer_c)
]
)
# Good: Blind synthesis
final = chairman.synthesize(
answers=[
("Response A", answer_a), # ✓ Anonymous
("Response B", answer_b),
("Response C", answer_c)
]
)
# Chairman evaluates content only, not brand
4. Explainability
Every council decision should be traceable:
result = await council.process(query)
# Access full decision trail
print(result.member_answers) # What each model said
print(result.critiques) # What each model criticized
print(result.scores) # How answers were ranked
print(result.explain()) # Synthesis reasoning
print(result.metadata) # Performance metrics
4. Phases Explained (Step-by-Step)
4.1 Generation Phase
Goal: Each model independently produces a complete answer
Process:
async def generate_phase(query: str, models: List[LLM]) -> List[Answer]:
"""
Phase 1: Independent parallel generation
Critical requirements:
- True isolation (no shared context)
- Parallel execution (no sequential dependency)
- Complete answers (not fragments)
"""
prompts = [
f"""You are participating in a council review. Provide your independent
analysis of the following query.
Query: {query}
Requirements:
- Provide complete reasoning
- State your confidence level
- Identify any assumptions
- Show your work
Answer:"""
for _ in models
]
# Execute in parallel for true independence
answers = await asyncio.gather(*[
model.generate(prompt)
for model, prompt in zip(models, prompts)
])
return answers
Example output (Security Review Task):
Query: "Review this authentication implementation for security issues"
Model A (GPT-4) Output:
"This implementation has 3 critical issues:
1. Tokens stored in localStorage (XSS vulnerable)
2. No CSRF protection on auth endpoints
3. Weak password hashing (bcrypt rounds too low)
Confidence: High (90%)
Assumptions: Standard web threat model, browser environment"
Model B (Claude) Output:
"Security analysis:
- Authentication: JWT-based, reasonable
- Storage: localStorage - CRITICAL VULNERABILITY
- Token validation: Present but incomplete
- Session management: Missing rotation
Issues prioritized by severity: [...]
Confidence: Medium (75%)
Assumptions: Modern browser, no native app"
Model C (Gemini) Output:
"Authentication review shows:
✓ JWT implementation correct
✗ Storage mechanism insecure
✗ Missing rate limiting
? Unclear token expiration policy
Recommendation: Implement httpOnly cookies with SameSite
Confidence: Medium-High (80%)"
Critical rules for this phase:
- No tool use: Models reason from knowledge only
- No memory: Each generation is independent
- No cross-talk: Models don’t see each other’s outputs
- Complete answers: Not drafts or partial responses
Why isolation matters:
# Bad: Context leakage
answer_a = await model_a.generate(query)
answer_b = await model_b.generate(query, previous=answer_a) # ❌ Contaminated
# Good: True independence
answers = await asyncio.gather(
model_a.generate(query),
model_b.generate(query),
model_c.generate(query)
) # ✓ Independent perspectives
4.2 Critique Phase
Goal: Each model evaluates others’ answers adversarially
Process:
async def critique_phase(
answers: List[Answer],
models: List[LLM]
) -> CritiqueMatrix:
"""
Phase 2: Cross-critique matrix
Each model reviews all others:
- Model A critiques [B, C]
- Model B critiques [A, C]
- Model C critiques [A, B]
Resulting in N x (N-1) critiques
"""
matrix = CritiqueMatrix(model_ids=[m.id for m in models])
tasks = []
for i, critic_model in enumerate(models):
# Get all answers EXCEPT this model's own
others_answers = [a for j, a in enumerate(answers) if j != i]
critique_prompt = f"""You are participating in a peer review council.
Review the following answers to the same query and provide structured critique.
Your answer was: {answers[i].content}
Other responses:
{format_answers(others_answers)}
For each response, provide:
1. Identified errors (be specific)
2. Strengths (what they got right)
3. Suggestions for improvement
4. Severity assessment (critical/major/minor/none)
Be adversarial but fair. Your goal is to catch errors, not just agree."""
task = critic_model.generate(critique_prompt)
tasks.append((i, task))
# Execute all critiques in parallel
results = await asyncio.gather(*[t[1] for t in tasks])
# Populate matrix
for (model_idx, _), critique in zip(tasks, results):
matrix.add_critique_by_model(model_idx, critique)
return matrix
Example critique output:
Model A critiquing Model B:
{
"target": "Model B",
"errors": [
"Claims token validation is 'incomplete' but doesn't specify what's missing",
"Overlooked the bcrypt rounds issue that I identified"
],
"strengths": [
"Correctly identified localStorage vulnerability",
"Good prioritization by severity"
],
"suggestions": [
"Be more specific about validation gaps",
"Check password hashing configuration"
],
"severity": "minor" # Errors are in critique quality, not answer correctness
}
Model A critiquing Model C:
{
"target": "Model C",
"errors": [
"Uses checkmarks/crosses which isn't structured output",
"Doesn't mention CSRF vulnerability"
],
"strengths": [
"Concrete recommendation (httpOnly cookies with SameSite)",
"Clear verdict on JWT implementation"
],
"suggestions": [
"Add CSRF to vulnerability list",
"Specify httpOnly + SameSite + Secure flags"
],
"severity": "major" # Missing important security consideration
}
Why this step is adversarial:
- Models are incentivized to find flaws
- No social pressure to “be nice”
- Disagreements are explicit and productive
- Errors that survived individual reasoning get caught
Common critique patterns:
- Factual corrections: “Answer B claims X, but actually Y”
- Logical flaws: “Answer C’s reasoning from A to B doesn’t follow”
- Missed considerations: “None of the answers mention Z”
- Overgeneralizations: “Answer A assumes X always, but only sometimes”
4.3 Scoring / Ranking Phase (Optional)
Goal: Quantify answer quality for synthesis prioritization
Process:
async def score_phase(
answers: List[Answer],
critiques: CritiqueMatrix
) -> Dict[str, Score]:
"""
Phase 3: Scoring (optional but recommended)
Combines:
- Self-reported confidence
- Critique severity
- Consensus detection
"""
scores = {}
for i, answer in enumerate(answers):
# Get all critiques targeting this answer
targeting_critiques = critiques.get_critiques_of(f"model_{i}")
score = Score(
# Self-assessment
self_confidence=answer.confidence,
# External assessment
critical_issues=count_severity(targeting_critiques, "critical"),
major_issues=count_severity(targeting_critiques, "major"),
minor_issues=count_severity(targeting_critiques, "minor"),
# Consensus
agreements=count_agreements(answer, other_answers),
# Derived score
composite=calculate_composite_score(
answer.confidence,
targeting_critiques,
agreements
)
)
scores[f"model_{i}"] = score
return scores
Scoring approaches:
1. Numeric scoring (0-100)
def calculate_numeric_score(answer: Answer, critiques: List[Critique]) -> float:
base_score = answer.confidence * 100 # Start with self-confidence
# Deduct for identified issues
for critique in critiques:
if critique.severity == "critical":
base_score -= 30
elif critique.severity == "major":
base_score -= 15
elif critique.severity == "minor":
base_score -= 5
# Bonus for consensus
consensus_bonus = count_agreements(answer) * 10
return max(0, min(100, base_score - penalties + consensus_bonus))
2. Ordinal ranking (1st, 2nd, 3rd)
def rank_answers(scores: Dict[str, Score]) -> Dict[str, int]:
sorted_models = sorted(scores.items(), key=lambda x: x[1].composite, reverse=True)
return {model: rank+1 for rank, (model, score) in enumerate(sorted_models)}
3. Binary (acceptable / unacceptable)
def binary_score(answer: Answer, critiques: List[Critique]) -> str:
has_critical = any(c.severity == "critical" for c in critiques)
low_confidence = answer.confidence < 0.5
return "unacceptable" if (has_critical or low_confidence) else "acceptable"
When to skip scoring:
- Council of 2 models (chairman can compare directly)
- Qualitative synthesis preferred
- All answers are high quality (scoring adds little value)
4.4 Chairman Synthesis Phase
Goal: Produce final answer by reasoning about disagreements
Process:
async def synthesis_phase(
query: str,
answers: List[Answer],
critiques: CritiqueMatrix,
scores: Optional[Dict[str, Score]]
) -> FinalAnswer:
"""
Phase 4: Chairman synthesis
The chairman:
1. Sees all answers (anonymized)
2. Sees all critiques
3. Sees scores (if available)
4. Reasons about disagreements
5. Integrates best insights
6. Produces final answer with citations
"""
# Anonymize to prevent brand bias
anonymous_answers = anonymize(answers)
synthesis_prompt = f"""You are the chairman of a council that reviewed this query:
Query: {query}
Council Members provided these independent answers:
{format_anonymous_answers(anonymous_answers)}
They then critiqued each other:
{format_critiques(critiques)}
{f"Scores: {scores}" if scores else ""}
Your task:
1. Identify areas of consensus (high confidence)
2. Analyze disagreements (understand why)
3. Evaluate critiques (which are valid?)
4. Resolve conflicts (what's actually correct?)
5. Integrate insights (combine best elements)
6. Produce final answer with:
- Clear verdict
- Supporting evidence
- Confidence level
- Cite which responses contributed what
The final answer should be better than any individual response."""
return await chairman_model.generate(synthesis_prompt)
Example synthesis output:
Chairman Synthesis:
\"\"\"
Final Security Assessment:
CRITICAL VULNERABILITIES:
1. localStorage storage (Responses A, B, C all identified)
→ Recommendation: Implement httpOnly cookies with Secure and SameSite flags
→ Confidence: Very High (unanimous agreement)
2. Missing CSRF protection (Response A identified, validated by others)
→ Recommendation: Add CSRF tokens to all state-changing endpoints
→ Confidence: High (Response A's critique of C was correct)
3. Weak password hashing (Response A identified, Response B overlooked)
→ Recommendation: Increase bcrypt rounds to 12+
→ Confidence: High (Response A's analysis validated)
ADDITIONAL CONCERNS:
4. Missing rate limiting (Response C identified)
→ Recommendation: Implement rate limiting on auth endpoints
→ Confidence: Medium (only one response noted this)
RESOLUTION OF DISAGREEMENTS:
- Response C's use of checkmarks was criticized but content was valid
- Response B's claim of "incomplete validation" was too vague (Response A's critique valid)
- All responses agreed on localStorage issue (highest confidence)
FINAL RECOMMENDATION:
Implement all 4 fixes in priority order. The localStorage vulnerability is the most critical
and should be addressed immediately.
Overall Confidence: High (85%)
Based on: Unanimous agreement on primary issue + validated secondary issues
\"\"\"
Critical principle: Chairman reasons, doesn’t just pick the best score
# Bad: Mechanical selection
final = answers[highest_score_index] # ❌ Loses insights from others
# Good: Integrative reasoning
final = chairman.synthesize(
consensus_points=find_agreements(answers),
resolved_conflicts=resolve(disagreements, critiques),
novel_insights=extract_unique_contributions(answers)
) # ✓ Combines best of all responses
What makes a good synthesis:
- Cites sources: “Response A identified X, Response B added insight Y”
- Explains confidence: “High confidence due to unanimous agreement”
- Resolves conflicts: “A and B disagree on X, but C’s critique shows B is correct”
- Integrates insights: “Combining A’s analysis with C’s recommendation”
- Acknowledges uncertainty: “Only one response noted Z, medium confidence”
5. Why This Works
The three core mechanisms
LLM Councils work because they exploit fundamental properties of language models and reasoning:
1. Diversity of reasoning paths
Different models (or same model with different temperatures) explore different solution spaces:
# Example: Math problem with multiple valid approaches
Query: "A train travels 120km in 2 hours, then 180km in 3 hours. What's the average speed?"
Model A (Algebraic):
total_distance = 120 + 180 # 300km
total_time = 2 + 3 # 5 hours
average_speed = 300 / 5 # 60 km/h
Model B (Weighted Average):
speed_1 = 120 / 2 # 60 km/h
speed_2 = 180 / 3 # 60 km/h
average = (speed_1 * 2 + speed_2 * 3) / 5 # 60 km/h
Model C (Incremental):
After 2h: 120km covered
After 5h: 120 + 180 = 300km covered
Average: 300 / 5 = 60 km/h
# All paths converge to the same answer (60 km/h)
# → High confidence
# If one model got 65 km/h, critiques would catch the error
Why diversity matters:
- Different approaches validate each other
- Errors in one path are caught by others
- Correct answer emerges from consensus
- Novel insights come from unique perspectives
2. Error detection via disagreement
When models disagree, at least one is wrong—this signals investigation is needed:
# Example: Disagreement reveals hidden error
Query: "Is this code thread-safe?"
Model A: "Yes, uses mutex locking"
Model B: "No, race condition on line 42"
Model C: "Yes, properly synchronized"
# Disagreement detected: B vs [A, C]
# Critique phase:
Model A critiquing B: "Line 42 is inside the lock, no race condition"
Model B critiquing A: "Lock is released before line 42 executes"
Model C critiquing B: "You're right, I missed the early return"
# Chairman synthesis:
"Model B correctly identified the race condition. The lock is released
on line 38 due to early return, leaving line 42 unprotected.
Confidence: High (error validated during critique)"
Without councils:
- Model A and C’s incorrect “Yes” would dominate
- Simple voting: 2-1 = wrong answer
- No mechanism to surface B’s correct critique
With councils:
- B’s disagreement forces investigation
- Critique phase validates B’s reasoning
- Chairman integrates the correction
- Final answer: Correct, with explanation
3. Self-critique under structured constraints
Models can critique better than they can generate because critique is easier than creation:
# Phenomenon: Models catch errors in others they wouldn't catch in themselves
Model A generates:
"The Eiffel Tower was built in 1889 in London" # Error
Model B generates:
"The Eiffel Tower was completed in 1889 in Paris" # Correct
Model A critiquing Model B:
"Correct. Built for 1889 World's Fair in Paris." # ✓ Would validate
Model B critiquing Model A:
"ERROR: Eiffel Tower is in Paris, not London." # ✓ Catches mistake
# Model A wouldn't have caught its own error
# But can catch others' errors when prompted to critique
Why critique is easier than generation:
- Verification is easier than creation
- Comparison reveals inconsistencies
- Adversarial framing activates different reasoning
- External perspective reduces blind spots
Why hallucinations collapse under scrutiny
Single model hallucination:
Model: "The Python `asyncio.gather` function was introduced in Python 2.7"
User: Accepts as fact (no verification mechanism)
Council catches hallucination:
Model A: "asyncio.gather was introduced in Python 2.7"
Model B critiques: "Error: asyncio was introduced in Python 3.4,
not Python 2.7. Python 2.7 had no asyncio module."
Model C critiques: "Confirmed, asyncio is Python 3.4+.
Python 2.7 had Twisted for async, not asyncio."
Chairman: "Model A hallucinated. Correct answer: asyncio.gather
introduced in Python 3.4 (January 2014).
Confidence: Very High (multiple models confirmed)"
Why this works:
- Hallucinations are usually not consistent across models
- Multiple independent checks catch fabrications
- Critique phase forces evidence-based reasoning
- Chairman requires validation before accepting claims
Empirical evidence
Research findings:
- Error reduction: Councils reduce factual errors by 40-60% vs single models
- Confidence calibration: Council confidence scores correlate better with correctness
- Novel insight generation: 15-20% of final answers contain insights no single model had
- Hallucination detection: 70-80% of single-model hallucinations caught in critique
Example from production:
Task: Code security review (1000 test cases)
Single Model (GPT-4):
- False negatives: 23 (missed real vulnerabilities)
- False positives: 31 (flagged non-issues)
- Accuracy: 94.6%
Three-Model Council (GPT-4, Claude, Gemini):
- False negatives: 8 (missed real vulnerabilities)
- False positives: 12 (flagged non-issues)
- Accuracy: 98.0%
Cost: 7x higher
Latency: 3.5x higher
Value: Catches 65% more critical security issues
When councils fail
Councils don’t solve everything:
1. Systematic bias: All models share training data biases 2. Unfamiliar domains: If no model has expertise, council can’t help 3. Ambiguous truth: Subjective questions have no “right” answer to converge on 4. Coordinated hallucination: Rarely, multiple models hallucinate the same thing
Example failure:
Query: "What color should this button be?" # Subjective
Model A: "Blue (professional)"
Model B: "Green (action-oriented)"
Model C: "Red (attention-grabbing)"
Chairman: ??? # No objective answer
# Council can't resolve subjective design preferences
The takeaway: Councils excel at objective reasoning where truth exists and can be validated through critique.
6. Council Design Patterns
6.1 Homogeneous Council
Pattern: Same model, different sampling parameters
council = LLMCouncil(
models=[
GPT4(temperature=0.3), # Conservative
GPT4(temperature=0.7), # Balanced
GPT4(temperature=1.0), # Creative
],
chairman=GPT4(temperature=0.5)
)
Advantages:
- Cost-effective (single API, potential batch discounts)
- Fast (same model family, similar latency)
- Catches stochastic errors (temperature-induced variations)
- Consistent output format
Limitations:
- Shared training biases
- Similar reasoning patterns
- Less diversity in perspectives
- Can miss systematic model errors
Use when:
- Budget constrained (3x cost instead of 7x with different models)
- Latency sensitive (parallel calls to same provider)
- Known model is strong (GPT-4, Claude Opus)
- Error type is stochastic (random hallucinations)
Example:
# Good for: Fact-checking where stochastic variation catches errors
Query: "What year was the Eiffel Tower built?"
Model @ temp=0.3: "1889"
Model @ temp=0.7: "1889"
Model @ temp=1.0: "1889"
# High confidence due to consensus across temperatures
6.2 Heterogeneous Council (Best Practice)
Pattern: Different models from different providers
council = LLMCouncil(
models=[
GPT4(provider="OpenAI"),
Claude3_5Sonnet(provider="Anthropic"),
GeminiPro(provider="Google")
],
chairman=GPT4() # Or Claude, or rotating
)
Advantages:
- Maximum diversity (different architectures, training data)
- Different reasoning strengths (each model has unique capabilities)
- Catches systematic errors (one model’s bias revealed by others)
- Provider redundancy (if one API is down, others continue)
Trade-offs:
- Higher cost (multiple API providers)
- Increased latency (different providers, varying speeds)
- Output format inconsistency (each model has different style)
- Complex orchestration (multiple API clients)
Use when:
- Accuracy matters (medical, legal, security)
- Safety is critical (high-stakes decisions)
- Budget allows (7-10x single model cost)
- Ground truth verification needed
Example:
# Good for: Complex security analysis
Query: "Review this authentication implementation for vulnerabilities"
GPT-4: Identifies 3 issues (strong on implementation details)
Claude: Identifies 5 issues (strong on architectural patterns)
Gemini: Identifies 4 issues (strong on standards compliance)
# Chairman integrates all findings → 6 unique issues found
# Better than any single model
6.3 Specialist Council
Pattern: Each model has an assigned role
council = SpecialistCouncil(
members={
"reasoner": Claude3_5Sonnet(
system_prompt="You are a logical reasoner. Focus on step-by-step analysis."
),
"verifier": GPT4(
system_prompt="You are a fact verifier. Challenge claims with evidence."
),
"devil's_advocate": Gemini(
system_prompt="You are a critic. Find flaws and edge cases."
),
"domain_expert": SpecializedModel(
system_prompt="You are a security expert. Focus on vulnerabilities."
)
},
chairman=GPT4(
system_prompt="You are the chairman. Synthesize all perspectives."
)
)
Role examples:
1. Reasoner (Builds logical chains)
Task: Analyze system design
Output: "If component A fails, then B loses input, causing C to timeout..."
2. Verifier (Checks facts and claims)
Task: Validate claims
Output: "Claim X is false. According to documentation Y, the actual behavior is Z."
3. Devil’s Advocate (Finds flaws)
Task: Challenge reasoning
Output: "This assumes normal operation. What about: power failure, network partition, corrupted state?"
4. Domain Expert (Provides specialized knowledge)
Task: Security review
Output: "This violates OWASP A02:2021 (Cryptographic Failures). Specifically..."
Advantages:
- Comprehensive coverage (each role brings unique perspective)
- Structured deliberation (clear responsibilities)
- Complementary strengths (roles designed to cover blind spots)
- Explainable process (can trace which role found what)
Implementation:
class SpecialistCouncil:
def __init__(self, members: Dict[str, LLM], chairman: LLM):
self.members = members
self.chairman = chairman
async def process(self, query: str) -> CouncilResult:
# Phase 1: Each specialist provides perspective
analyses = {}
for role, model in self.members.items():
prompt = self._build_role_prompt(query, role)
analyses[role] = await model.generate(prompt)
# Phase 2: Specialists critique each other
critiques = {}
for role, model in self.members.items():
# Each specialist reviews others' analyses from their role perspective
other_analyses = {r: a for r, a in analyses.items() if r != role}
critique_prompt = self._build_critique_prompt(role, other_analyses)
critiques[role] = await model.generate(critique_prompt)
# Phase 3: Chairman synthesizes
final = await self.chairman.synthesize(
query=query,
specialist_analyses=analyses,
specialist_critiques=critiques
)
return CouncilResult(
final_answer=final,
specialist_contributions=analyses,
critiques=critiques,
roles=list(self.members.keys())
)
Use when:
- Task has clear role divisions (e.g., “analyze risk” + “verify claims” + “find edge cases”)
- Need comprehensive coverage (want to systematically explore all angles)
- Explainability matters (can show “security expert found X, verifier confirmed Y”)
6.4 Hybrid Patterns
Pattern: Heterogeneous Specialists
Combine patterns 6.2 and 6.3:
council = LLMCouncil(
members={
"reasoner": GPT4(), # Best at logical chains
"verifier": Claude3_5(), # Best at evidence evaluation
"critic": Gemini(), # Best at finding gaps
},
chairman=GPT4()
)
Each specialist is the best model for that role.
Pattern: Homogeneous Specialists
Same model, different role prompts:
council = LLMCouncil(
members={
"reasoner": GPT4(system_prompt="Focus on logic"),
"verifier": GPT4(system_prompt="Focus on facts"),
"critic": GPT4(system_prompt="Focus on flaws"),
},
chairman=GPT4()
)
Cheaper but less diverse.
Pattern selection guide
| Requirement | Pattern | Models | Cost | Quality |
|---|---|---|---|---|
| Budget-sensitive | Homogeneous | 3x same | 3-4x | Good |
| High accuracy | Heterogeneous | 3+ different | 7-10x | Excellent |
| Structured roles | Specialist | Role-based | 5-8x | Excellent |
| Maximum quality | Heterogeneous Specialists | 3+ different roles | 10-15x | Outstanding |
Production recommendation: Start with heterogeneous (different models), add specialists if role clarity helps.
7. Production Use Cases
When to use LLM Councils
LLM Councils are justified when:
1. High-stakes decisions
Decisions where errors have significant consequences:
# Example: Financial compliance review
Query: "Does this transaction violate anti-money laundering regulations?"
Single model error: False negative → Regulatory fine + legal liability
Council benefit: 3x error reduction → Risk mitigation worth the cost
Use cases:
- Contract review (legal obligations)
- Compliance checking (regulatory requirements)
- Risk assessment (financial decisions)
- System design (architectural decisions)
2. Low tolerance for hallucinations
Domains where fabricated information is dangerous:
# Example: Medical literature synthesis
Query: "Summarize drug interactions for medication X"
Single model risk: Hallucinates interaction → Patient harm
Council benefit: Cross-validation catches fabrications
Use cases:
- Medical advice systems
- Safety-critical documentation
- Fact-checking platforms
- Educational content
3. Verifiable correctness required
Tasks where answers can be validated against ground truth:
# Example: Code security review
Query: "Identify vulnerabilities in this authentication system"
Single model: May miss 30% of vulnerabilities
Council: Reduces false negatives by 65%
Validation: Findings can be tested and verified
Use cases:
- Code review
- Security analysis
- Architectural validation
- Protocol verification
Detailed production examples
Example 1: Code Review
task = "Review this PR for security, correctness, and maintainability"
council = LLMCouncil(
models=[
GPT4(focus="security"),
Claude3_5(focus="architecture"),
Gemini(focus="maintainability")
]
)
result = await council.review(pr_diff)
# Output:
# - 8 issues found (vs 3 from single model)
# - Categorized by severity
# - Confidence scores per issue
# - Actionable recommendations
Value: Catches critical bugs before production deployment.
Example 2: Legal Document Analysis
task = "Identify legal risks in this contract"
council = SpecialistCouncil(
members={
"contract_analyst": GPT4(),
"risk_assessor": Claude3_5(),
"compliance_checker": Gemini()
}
)
result = await council.analyze(contract)
# Output:
# - Ambiguous clauses identified
# - Potential liabilities flagged
# - Compliance gaps noted
# - Risk mitigation suggestions
Value: Prevents costly legal disputes.
Example 3: System Design Validation
task = "Review this microservices architecture for scalability issues"
council = LLMCouncil(
models=[GPT4(), Claude3_5(), Gemini()]
)
result = await council.validate(architecture_doc)
# Output:
# - Single points of failure identified
# - Scalability bottlenecks found
# - Data consistency issues flagged
# - Alternative approaches suggested
Value: Avoids expensive refactoring later.
8. When NOT to Use an LLM Council
Avoid councils when
1. Latency must be <1 second
Councils are inherently slower due to multiple phases:
# Typical council latency breakdown:
Phase 1 (Generation): 3 x 2s = 6s (parallel)
Phase 2 (Critique): 3 x 3s = 9s (parallel)
Phase 3 (Synthesis): 1 x 5s = 5s
Total: ~20s (vs 2s for single model)
Bad for:
- Real-time chat interfaces
- Interactive debugging tools
- Live autocomplete systems
- Time-sensitive alerts
Alternative: Use single fast model with caching.
2. Tasks are creative or subjective
No objective “right” answer to converge on:
Query: "Write a creative story about a dragon"
Model A: Fantasy adventure
Model B: Modern allegory
Model C: Historical fiction
Chairman: ??? (All are valid, no objective best)
Bad for:
- Creative writing
- UI design decisions
- Marketing copy
- Artistic choices
Alternative: Use single model, potentially with temperature sampling.
3. Budget is extremely constrained
Councils cost 3-15x more:
# Cost comparison (assuming $0.01 per 1K tokens)
Single model: $0.05 per query
Homogeneous council (3 models): $0.15 per query
Heterogeneous council (3 models): $0.35 per query
# At 10M queries/month:
Single model: $500K/month
Council: $3.5M/month
Bad for:
- High-volume low-value queries
- Exploratory prototypes
- Non-critical applications
- Startups with limited budgets
Alternative: Use councils only for critical queries, single model for rest.
4. Deterministic output required
Councils have inherent variability:
# Same query, run twice:
Run 1: Model A leads, final answer emphasizes security
Run 2: Model B leads, final answer emphasizes performance
# Both correct, but different focus → non-deterministic
Bad for:
- Automated pipelines requiring exact output format
- Testing frameworks
- Cached results systems
- Reproducible research
Alternative: Use single model with temperature=0.
5. Simple factual queries
Overkill for straightforward lookups:
Query: "What is the capital of France?"
Single model: "Paris" (correct, 2s, $0.01)
Council: "Paris" (correct, 20s, $0.35)
# 18s latency + $0.34 for no added value
Bad for:
- FAQ systems
- Simple data lookups
- Obvious questions
- Well-established facts
Alternative: Use knowledge base or single model.
Decision framework
def should_use_council(query: Query) -> bool:
"""
Decision tree for council vs single model
"""
# High-stakes? → Consider council
if query.stakes in ["critical", "high"]:
# Can afford latency?
if query.latency_tolerance > 10: # seconds
# Can afford cost?
if query.budget_multiplier >= 7:
return True
# Subjective or creative? → No council
if query.type in ["creative", "subjective", "opinion"]:
return False
# Simple fact? → No council
if query.complexity == "simple" and query.verifiable:
return False
# Default: Single model
return False
The rule: Use councils deliberately when the value justifies the cost, not as a default.
9. Cost & Latency Considerations
Cost analysis
Formula:
Total Cost = (N × Generation Cost) + (N × Critique Cost) + (Chairman Cost)
Where:
- N = number of council members
- Generation Cost = per-model inference cost for initial answer
- Critique Cost = per-model inference cost for reviewing others
- Chairman Cost = synthesis inference cost
Typical configuration:
# 3-model heterogeneous council
members = [GPT4(), Claude3_5(), Gemini()]
chairman = GPT4()
# Cost breakdown (example pricing):
Generation: 3 models × $0.10 = $0.30
Critique: 3 models × $0.15 = $0.45 # Longer prompts (includes other answers)
Synthesis: 1 model × $0.20 = $0.20 # Longest prompt (all answers + critiques)
Total: $0.95 per query
# vs Single model: $0.10 per query
# Multiplier: 9.5x
Cost optimization strategies:
1. Homogeneous councils (cheaper):
# Same model, different temperatures
members = [GPT4(temp=0.3), GPT4(temp=0.7), GPT4(temp=1.0)]
Cost: 3-4x single model (vs 7-10x for heterogeneous)
2. Selective councils (smarter):
def route_query(query):
if query.risk_level == "critical":
return full_council.process(query) # 10x cost
elif query.risk_level == "medium":
return light_council.process(query) # 3x cost
else:
return single_model.process(query) # 1x cost
3. Cached critiques:
# Cache common critique patterns
if query.similar_to_previous():
reuse_critique_template() # Save critique phase cost
Latency analysis
Sequential vs parallel execution:
# Bad: Sequential execution
answer_a = await model_a.generate(query) # 2s
answer_b = await model_b.generate(query) # 2s
answer_c = await model_c.generate(query) # 2s
# Total: 6s
# Good: Parallel execution
answers = await asyncio.gather(
model_a.generate(query),
model_b.generate(query),
model_c.generate(query)
) # Total: 2s (limited by slowest model)
Latency breakdown:
# Optimized parallel council
Phase 1 (Generation): max(2s, 2s, 2s) = 2s # Parallel
Phase 2 (Critique): max(3s, 3s, 3s) = 3s # Parallel
Phase 3 (Synthesis): 5s # Sequential
Total: 10s
# vs Single model: 2s
# Multiplier: 5x
Latency optimization strategies:
1. Streaming synthesis:
# Start synthesis before all critiques complete
async for partial in chairman.stream_synthesis():
yield partial # Start returning results immediately
2. Tiered timeout:
# Don't wait for stragglers
answers = await asyncio.wait_for(
asyncio.gather(*generate_tasks),
timeout=5.0 # Proceed with 2/3 answers if one is slow
)
3. Model selection:
# Use faster models for time-sensitive queries
if query.latency_critical:
members = [Claude3_Haiku(), GPT4oMini(), GeminiFlash()] # Fast models
else:
members = [Claude3_5Opus(), GPT4(), GeminiPro()] # Slow but accurate
10. Failure Modes (Important)
Common mistakes that break councils
1. Sharing context between generators
# Bad: Context leakage
answer_a = await model_a.generate(query)
answer_b = await model_b.generate(query, context=answer_a) # ❌ Contaminated
# Good: True independence
answers = await asyncio.gather(
model_a.generate(query),
model_b.generate(query)
) # ✓ Independent
Why this breaks councils: Models anchor on first answer, reducing diversity.
2. Letting chairman see model identities
# Bad: Brand bias
final = chairman.synthesize([
("GPT-4", answer_a), # ❌ Chairman may favor "GPT-4" brand
("Claude", answer_b),
("Gemini", answer_c)
])
# Good: Anonymous synthesis
final = chairman.synthesize([
("Response A", answer_a), # ✓ Evaluated on merit only
("Response B", answer_b),
("Response C", answer_c)
])
Why this breaks councils: Brand bias overrides content quality assessment.
3. Using same prompt everywhere
# Bad: Generic prompts
prompt = "Answer this: {query}"
# All models get identical prompt → similar reasoning paths
# Good: Role-specific prompts
prompts = {
"model_a": "You are a careful reasoner. Analyze: {query}",
"model_b": "You are a critical verifier. Evaluate: {query}",
"model_c": "You are a creative problem-solver. Address: {query}"
}
Why this breaks councils: Identical prompts reduce diversity of thought.
4. No explicit critique rubric
# Bad: Vague critique
critique_prompt = "Review the other answers"
# Models don't know what to look for
# Good: Structured rubric
critique_prompt = """
Review each answer for:
1. Factual errors (cite specific mistakes)
2. Logical flaws (explain reasoning gaps)
3. Missed considerations (what did they overlook?)
4. Severity (critical/major/minor)
Format your critique as:
- Target: [Which answer]
- Errors: [List of specific errors]
- Severity: [Assessment]
"""
Why this breaks councils: Unstructured critiques miss errors or are too vague.
5. Ignoring disagreement
# Bad: Mechanical voting
final = majority_vote(answers) # ❌ Loses minority insights
# Good: Investigate disagreement
final = chairman.analyze_disagreement(answers, critiques) # ✓ Understands why
Why this breaks councils: Minority correct answer gets discarded by voting.
The meta-failure: A bad council is worse than a single model
Why:
- False confidence (consensus on wrong answer)
- Increased cost with no benefit
- Longer latency for same result
- Debugging is harder (which model failed?)
Warning signs:
- Council always agrees (too little diversity)
- Council never agrees (too much noise)
- Final answer = first model’s answer (chairman not synthesizing)
- Critiques are generic praise (not adversarial)
11. Evaluation & Metrics
Key metrics to track
1. Disagreement rate
def disagreement_rate(answers: List[Answer]) -> float:
"""
Measure: How often do models produce different answers?
High disagreement (>50%) = Diverse perspectives (good)
Low disagreement (<10%) = Redundant models (bad)
"""
unique_answers = len(set(answers))
return unique_answers / len(answers)
Target: 40-60% disagreement rate (healthy diversity)
2. Error detection rate
def error_detection_rate(council_result: CouncilResult, ground_truth: str) -> Dict:
"""
Measure: How many errors did critique phase catch?
"""
single_model_errors = count_errors(council_result.member_answers[0], ground_truth)
final_errors = count_errors(council_result.final_answer, ground_truth)
return {
"errors_caught": single_model_errors - final_errors,
"detection_rate": (single_model_errors - final_errors) / single_model_errors
}
Target: >60% error reduction from best single model
3. Chairman override frequency
def chairman_override_rate(results: List[CouncilResult]) -> float:
"""
Measure: How often does chairman pick minority answer?
"""
overrides = sum(
1 for result in results
if result.final_answer != result.majority_answer
)
return overrides / len(results)
Target: 15-30% override rate (chairman adds value, not just voting)
4. Confidence calibration
def confidence_calibration(results: List[CouncilResult]) -> Dict:
"""
Measure: Does high confidence correlate with correctness?
"""
high_confidence_correct = [
r for r in results
if r.confidence > 0.8 and r.is_correct
]
return {
"precision_at_high_confidence": len(high_confidence_correct) / count_high_confidence(results),
"recall_of_correct_answers": len(high_confidence_correct) / count_correct(results)
}
Target: >90% precision when confidence >80%
Production monitoring
class CouncilMetrics:
def __init__(self):
self.metrics = {
"total_queries": 0,
"avg_latency": 0,
"avg_cost": 0,
"disagreement_rate": 0,
"error_detection_rate": 0,
"chairman_overrides": 0,
"confidence_calibration": {}
}
def log_query(self, result: CouncilResult):
self.metrics["total_queries"] += 1
self.metrics["avg_latency"] = rolling_avg(result.latency)
self.metrics["avg_cost"] = rolling_avg(result.cost)
# ... update other metrics
def alert_if_degraded(self):
if self.metrics["disagreement_rate"] < 0.1:
alert("Council diversity too low - models too similar")
if self.metrics["error_detection_rate"] < 0.3:
alert("Council not catching errors - check critique quality")
Monitor: Real-time dashboards tracking these metrics
12. Security & Safety
Mandatory safeguards
1. Output sanitization
def sanitize_council_output(result: CouncilResult) -> str:
"""
Remove sensitive information before returning to user
"""
final = result.final_answer
# Remove any model-specific artifacts
final = remove_model_signatures(final)
# Remove internal reasoning if not needed
if not user_requested_explanation:
final = remove_internal_reasoning(final)
# Sanitize PII that might have been in training data
final = redact_pii(final)
return final
2. Prompt isolation
# Bad: User input directly in prompts
prompt = f"Analyze: {user_input}" # ❌ Prompt injection risk
# Good: Structured prompts with isolation
prompt = construct_prompt(
system="You are a council member",
user_input=escape_and_validate(user_input), # ✓ Sanitized
guardrails=COUNCIL_GUARDRAILS
)
3. Model identity abstraction
# Never expose which model said what
# Bad:
return {
"answer": final,
"sources": {
"gpt4": answer_a, # ❌ Exposes model identity
"claude": answer_b
}
}
# Good:
return {
"answer": final,
"confidence": 0.85,
"methodology": "Multi-model review" # ✓ Generic
}
4. Logging of critiques
# Log for debugging but never expose
logger.info({
"query_id": query_id,
"member_answers": answers, # Internal only
"critiques": critiques, # Internal only
"final_answer": final, # Returned to user
"confidence": confidence # Returned to user
})
# Audit trail for security reviews
audit_log.record(query_id, user_id, final_answer, timestamp)
5. Rate limiting
# Councils are expensive - prevent abuse
@rate_limit(max_requests=10, window="1h", resource="council")
async def handle_council_query(user_id: str, query: str):
return await council.process(query)
Never expose raw council transcripts
# Bad: Exposing internal deliberation
return {
"answer": final,
"transcript": { # ❌ Security risk
"model_a": "...",
"model_b": "...",
"critiques": "...",
"synthesis_reasoning": "..."
}
}
# Good: Clean output only
return {
"answer": final,
"confidence": 0.85,
"explanation": optional_high_level_summary() # If requested
}
Why: Internal deliberation may contain:
- Hallucinations that were caught (don’t want users to see)
- Model-specific artifacts
- Intermediate reasoning that’s not user-facing
- Critiques that could confuse users
13. Minimal Production Pseudocode
Complete implementation
from typing import List, Dict
import asyncio
class LLMCouncil:
def __init__(self, models: List[LLM], chairman: LLM):
self.models = models
self.chairman = chairman
async def process(self, query: str) -> CouncilResult:
# Phase 1: Independent generation (parallel)
answers = await asyncio.gather(*[
model.generate(query)
for model in self.models
])
# Phase 2: Cross-critique (parallel)
critique_tasks = []
for i, critic in enumerate(self.models):
# Each model critiques all others
others = [a for j, a in enumerate(answers) if j != i]
critique_tasks.append(critic.critique(others))
critiques = await asyncio.gather(*critique_tasks)
# Phase 3: Chairman synthesis (sequential)
final = await self.chairman.synthesize(
query=query,
answers=self._anonymize(answers),
critiques=critiques
)
return CouncilResult(
final_answer=final,
member_answers=answers,
critiques=critiques,
confidence=self._calculate_confidence(answers, critiques, final)
)
def _anonymize(self, answers: List[Answer]) -> List[Answer]:
"""Remove model identities"""
return [
Answer(content=a.content, label=f"Response {chr(65+i)}")
for i, a in enumerate(answers)
]
def _calculate_confidence(self, answers, critiques, final) -> float:
"""Derive confidence from consensus and critique severity"""
consensus = len([a for a in answers if similar(a, final)]) / len(answers)
critical_issues = count_critical_critiques(critiques)
return max(0.0, consensus - (critical_issues * 0.1))
# Usage
council = LLMCouncil(
models=[GPT4(), Claude3_5(), Gemini()],
chairman=GPT4()
)
result = await council.process("Review this code for security issues")
print(result.final_answer)
print(f"Confidence: {result.confidence:.1%}")
Simple structure — hard engineering:
- Parallel execution for performance
- Isolation for independence
- Anonymization for fairness
- Confidence scoring for trust
14. LLM Council vs Alternatives
Comparison matrix
| Approach | Reliability | Cost | Latency | Explainability | Use Case |
|---|---|---|---|---|---|
| Single LLM | Low (60-70% accuracy) | Low (1x) | Low (2s) | Low | General queries |
| Self-reflection | Medium (70-80% accuracy) | Medium (2-3x) | Medium (5s) | Medium | Iterative refinement |
| LLM Council | High (85-95% accuracy) | High (7-10x) | High (10-20s) | High | High-stakes decisions |
| Ensemble (voting) | Medium (75-85% accuracy) | Medium (3x) | Low (2s parallel) | Low | Classification tasks |
| RAG | Medium (context-dependent) | Low (1-2x) | Medium (5s) | Medium | Knowledge retrieval |
When to use each
Single LLM:
- Low stakes
- Budget constrained
- Latency critical
- Exploratory queries
Self-reflection:
- Moderate stakes
- Iterative improvement needed
- Single model sufficient but needs refinement
- Budget allows 2-3x cost
LLM Council:
- High stakes
- Low error tolerance
- Objective truth exists
- Budget allows 7-10x cost
- Latency tolerance >10s
Ensemble (voting):
- Classification tasks
- Multiple valid approaches
- Fast results needed
- Budget allows 3x cost
RAG:
- Knowledge retrieval
- Factual questions
- Document-based answers
- Budget constrained
Hybrid approaches
Council + RAG:
# Use RAG to ground council members in facts
answers = await asyncio.gather(*[
model.generate(query, context=rag_retrieve(query))
for model in council
])
Council for critical, single for routine:
if query.risk_level == "critical":
return await council.process(query) # 10x cost
else:
return await single_model.process(query) # 1x cost
15. Final Takeaway
LLM Councils are about governance, not intelligence
Key insights:
Councils don’t make models smarter
- They make systems more reliable
- Through structured disagreement and critique
- Not through better individual reasoning
They make systems harder to fool
- Hallucinations collapse under peer scrutiny
- Multiple independent checks catch fabrications
- Critique phase forces evidence-based reasoning
They make systems harder to break
- No single point of failure
- Diverse perspectives catch blind spots
- Adversarial review surfaces edge cases
They make systems easier to trust
- Explainable decision process
- Confidence calibration
- Traceable reasoning chains
- Validation through consensus
Use them deliberately
When councils are worth it:
- High-stakes decisions (legal, medical, financial, security)
- Low tolerance for errors (safety-critical systems)
- Objective truth exists (verifiable correctness)
- Budget supports 7-10x cost
- Latency tolerance allows 10-20s
When councils aren’t worth it:
- Low-stakes queries
- Creative or subjective tasks
- Budget constrained
- Latency critical (<1s)
- Simple factual lookups
The principle
Don’t use councils because they’re cool. Use them because the value of reliability justifies the cost of governance.
Production mindset:
- Start with single model
- Measure error rate and impact
- If errors are costly, try council on subset
- Measure improvement vs cost
- Scale deliberately where value is proven
The goal: Build systems that are reliable enough to trust with decisions that matter.
Resources
Official Implementations
LLM Council frameworks:
- LangChain Multi-Agent Systems - Production-ready council orchestration
- AutoGen Multi-Agent Framework - Microsoft’s agent conversation framework
Research Papers
Foundational work:
- “Constitutional AI: Harmlessness from AI Feedback” - Anthropic’s work on AI critique
- “Self-Consistency Improves Chain of Thought Reasoning” - Multiple reasoning paths
- “Self-Refine: Iterative Refinement with Self-Feedback” - Self-critique mechanisms
Practical Guides
Implementations and tutorials:
- Andrej Karpathy’s LLM Council Explanation - Conceptual walkthrough
- Building Production LLM Systems - Chip Huyen’s comprehensive guide
Production Examples
Real-world council systems:
- Code review automation (GitHub Copilot-style)
- Legal document analysis (contract review)
- Medical literature synthesis (research assistants)
- Security vulnerability scanning (multi-model detection)