Skip to main content

Overview

One of Forge’s most powerful features: compare results from multiple AI agents side-by-side. See which approach works best, cherry-pick the best parts, or combine solutions.

Why Compare?

Different AI agents have different strengths:
AgentStrengthWeakness
Claude SonnetComplex logic, architectureCan be verbose
Gemini FlashFast, conciseMay miss edge cases
CursorUI/UX intuitionLess depth on algorithms
GPT-4ComprehensiveExpensive
Solution: Try multiple, compare results, choose the best!

Quick Comparison

Via CLI

# Compare all attempts for a task
forge task compare 1

# Compare specific attempts
forge diff task-1-claude task-1-gemini

# Show only statistics
forge task compare 1 --stats

Via Web UI

1

Open Task

Click on task card in Kanban board
2

View Attempts Tab

Click Attempts tab to see all attempts
3

Select Attempts to Compare

Check boxes next to 2+ attempts
4

Click 'Compare'

Opens split-view comparison interface

Comparison Views

File-by-File Diff

forge diff task-1-claude task-1-gemini --files
Output:
src/auth/login.ts
  Claude:  145 lines (+132, -13)
  Gemini:   98 lines (+89, -9)

src/auth/signup.ts
  Claude:  178 lines (+165, -13)
  Gemini:  142 lines (+135, -7)

src/utils/jwt.ts
  Claude:   67 lines (+67, new file)
  Gemini:   45 lines (+45, new file)

Side-by-Side Code Comparison

forge diff task-1-claude task-1-gemini \
  --file src/auth/login.ts \
  --side-by-side
Output:
Claude                                  │ Gemini
────────────────────────────────────────┼───────────────────────────────────────
import bcrypt from 'bcryptjs';         │ import bcrypt from 'bcrypt';
import jwt from 'jsonwebtoken';        │ import jwt from 'jsonwebtoken';
import { validateEmail } from './util';│

export async function login(req, res) {│ export const login = async (req, res) => {
  const { email, password } = req.body;│   const { email, password } = req.body;

  if (!validateEmail(email)) {         │   // Find user
    return res.status(400).json({      │   const user = await User.findOne({ email });
      error: 'Invalid email format'    │
    });                                │
  }                                    │

  const user = await User.findOne(...  │   if (!user || !await bcrypt.compare(...

Statistics Comparison

forge task compare 1 --stats
Output:
╭───────────┬─────────┬─────────┬─────────╮
│ Metric    │ Claude  │ Gemini  │ Cursor  │
├───────────┼─────────┼─────────┼─────────┤
│ Duration  │ 5m 22s  │ 2m 15s  │ 4m 01s  │
│ Files     │ 8       │ 6       │ 7       │
│ Lines +   │ 364     │ 269     │ 312     │
│ Lines -   │ 33      │ 16      │ 28      │
│ Tests     │ ✅ 24   │ ⚠️  18  │ ✅ 22   │
│ Cost      │ $0.234  │ $0.089  │ $0.187  │
╰───────────┴─────────┴─────────┴─────────╯

Detailed Comparison

Test Results

forge task compare 1 --tests
Output:
Test Coverage Comparison:

Claude (95% coverage):
  ✅ Login with valid credentials
  ✅ Login with invalid email
  ✅ Login with wrong password
  ✅ Login with missing fields
  ✅ JWT token generation
  ✅ Token expiry handling
  ⚠️  Missing: Rate limiting tests

Gemini (78% coverage):
  ✅ Login with valid credentials
  ✅ Login with wrong password
  ⚠️  Missing: Email validation tests
  ⚠️  Missing: Edge case handling
  ⚠️  Missing: Token expiry tests

Winner: Claude (more comprehensive)

Code Quality Metrics

forge task compare 1 --quality
Output:
Code Quality Analysis:

Claude:
  Complexity: Medium (Cyclomatic: 8)
  Maintainability: A
  TypeScript coverage: 100%
  Comments: Comprehensive
  Error handling: Excellent

Gemini:
  Complexity: Low (Cyclomatic: 4)
  Maintainability: A+
  TypeScript coverage: 100%
  Comments: Minimal
  Error handling: Basic

Winner: Tie (trade-off: Claude more robust, Gemini cleaner)

Performance Analysis

forge task compare 1 --benchmark
Output:
Performance Benchmarks:

Login Endpoint (1000 requests):
  Claude:  avg 45ms, p95 78ms, p99 125ms
  Gemini:  avg 38ms, p95 65ms, p99 98ms
  Cursor:  avg 42ms, p95 72ms, p99 110ms

Winner: Gemini (fastest)

Memory Usage:
  Claude:  ~120MB
  Gemini:  ~95MB
  Cursor:  ~108MB

Winner: Gemini (most efficient)

Visual Comparison (Web UI)

The Forge UI provides rich visual comparisons:

Split-Screen Editor

  • Left pane: Attempt 1 code
  • Right pane: Attempt 2 code
  • Synchronized scrolling
  • Inline diff highlighting

Architecture Diagram

Auto-generated comparison:
Claude's Approach:          Gemini's Approach:
┌────────────────┐         ┌────────────────┐
│  Controller    │         │  Handler       │
└───────┬────────┘         └───────┬────────┘
        │                          │
        ▼                          ▼
┌────────────────┐         ┌────────────────┐
│  Validator     │         │  User Service  │
└───────┬────────┘         └───────┬────────┘
        │                          │
        ▼                          ▼
┌────────────────┐         ┌────────────────┐
│  Service       │         │  Database      │
└───────┬────────┘         └────────────────┘


┌────────────────┐
│  Repository    │
└────────────────┘

Claude: More layers, separation   Gemini: Simpler, direct

Decision Matrix

Build a decision matrix to choose systematically:
forge task compare 1 --matrix
Output:
╭────────────────┬─────────┬─────────┬─────────╮
│ Criteria       │ Claude  │ Gemini  │ Cursor  │
├────────────────┼─────────┼─────────┼─────────┤
│ Correctness    │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐⭐   │ ⭐⭐⭐⭐  │
│ Performance    │ ⭐⭐⭐⭐   │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐⭐  │
│ Maintainability│ ⭐⭐⭐⭐   │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐⭐  │
│ Test Coverage  │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐    │ ⭐⭐⭐⭐  │
│ Documentation  │ ⭐⭐⭐⭐⭐ │ ⭐⭐     │ ⭐⭐⭐   │
│ Cost           │ ⭐⭐⭐    │ ⭐⭐⭐⭐⭐ │ ⭐⭐⭐⭐  │
├────────────────┼─────────┼─────────┼─────────┤
│ Total Score    │ 24/30   │ 23/30   │ 22/30   │
╰────────────────┴─────────┴─────────┴─────────╯

Recommendation: Claude (best overall balance)

Cherry-Picking Best Parts

Sometimes you want to combine approaches:

Manual Cherry-Pick

# Start with Claude's attempt
git checkout task-1-claude

# Cherry-pick specific file from Gemini
git checkout task-1-gemini -- src/utils/jwt.ts

# Cherry-pick specific changes
git show task-1-gemini:src/auth/login.ts | patch -p1

Via Web UI

  1. Open comparison view
  2. Select blocks of code from different attempts
  3. Click “Create Combined Attempt”
  4. Forge creates new attempt merging selected parts

Common Comparison Scenarios

Feature Implementation

Question: Which agent implemented the feature most completely?
# Check if all requirements met
forge task compare 1 --requirements-checklist

# Output:
# Claude:  ✅ All 8 requirements met
# Gemini:  ⚠️  6/8 requirements met (missing: rate limiting, password reset)
# Cursor:  ⚠️  7/8 requirements met (missing: email verification)

Bug Fix

Question: Which fix actually solves the bug without breaking anything?
# Run tests for each attempt
forge task compare 1 --run-tests

# Output:
# Claude:  ✅ All tests pass (24/24)
# Gemini:  ⚠️  2 tests fail (22/24)
# Cursor:  ✅ All tests pass (24/24)

# Check if bug is fixed
forge task compare 1 --validate-fix

# Output:
# Claude:  ✅ Bug fixed, no regressions
# Cursor:  ✅ Bug fixed, no regressions
# Gemini:  ❌ Bug still present in edge case

Refactoring

Question: Which refactor improves code without changing behavior?
# Verify behavior unchanged
forge task compare 1 --behavior-test

# Output:
# Claude:  ✅ All E2E tests pass
# Gemini:  ✅ All E2E tests pass
# Cursor:  ⚠️  1 E2E test fails (regression)

# Check code quality improvement
forge task compare 1 --complexity

# Output:
# Original: Cyclomatic complexity 15
# Claude:   Cyclomatic complexity 8  (-47%)
# Gemini:   Cyclomatic complexity 6  (-60%) ← Winner!
# Cursor:   Cyclomatic complexity 9  (-40%)

Export Comparison Reports

Generate Report

# Full comparison report
forge task compare 1 --report --output comparison-report.md

# HTML report with charts
forge task compare 1 --report --format html --output report.html

# JSON for programmatic use
forge task compare 1 --report --format json --output report.json

Share with Team

# Upload to GitHub Gist
forge task compare 1 --share --gist

# Output:
# https://gist.github.com/user/abc123

# Or create PR description
forge task compare 1 --pr-description > pr-description.md

Best Practices

Test All Attempts

Don’t just read code - run tests:
forge task test-all 1
Code that looks good but fails tests is useless

Check Edge Cases

Specifically test edge cases:
# Test with invalid inputs
# Test with empty data
# Test error handling
Simple cases are easy; edges matter

Consider Maintainability

Today: All work Six months from now: Which will you understand?Prefer clear code over clever code

Document Your Choice

forge task annotate 1 \
  --chosen claude \
  --reason "Better test coverage and clearer error handling"
Future you will thank you

Advanced: A/B Testing in Production

For critical features, deploy multiple attempts for A/B testing:
// Feature flag routing
const implementation = featureFlags.get('auth-impl');

if (implementation === 'claude') {
  return claudeAuth(req, res);
} else if (implementation === 'gemini') {
  return geminiAuth(req, res);
}
Monitor metrics:
  • Response times
  • Error rates
  • User feedback
Choose winner based on real-world data!

Troubleshooting

Issue: Comparing large attempts is slowSolutions:
  • Use --files-only to skip detailed diffs
  • Compare specific files: --file src/auth/login.ts
  • Increase timeout: --timeout 300
Issue: Attempts look identicalSolutions:
  • Check if comparing same attempt twice
  • Use --ignore-whitespace to see real changes
  • Try --context 10 for more surrounding lines
Issue: Can’t run tests in worktreesSolutions:
  • Ensure dependencies installed in each worktree
  • Check test paths are correct
  • Use --setup-cmd "npm install" first

Next Steps