Skip to main content

Overview

Run powerful AI coding agents completely local - no API keys, no cloud services, no data leaving your machine. Perfect for privacy-sensitive work, learning, or unlimited free usage.
Privacy First: All code stays on your machine. No external API calls.

Supported Models

OpenCode

OpenCode

Open-source coding model
  • By: Open-source community
  • Size: 15B parameters
  • Context: 16K tokens
  • Requirements: 16GB+ RAM, GPU recommended
  • License: Apache 2.0
Best for: General coding, privacy-sensitive work

Qwen Code

Qwen Code

Alibaba’s open-source coder
  • By: Alibaba DAMO Academy
  • Size: 7B, 14B, 32B parameters
  • Context: 32K tokens
  • Requirements: 8GB-64GB RAM depending on size
  • License: Apache 2.0
Best for: Cost-conscious development, competitive with commercial

Quick Start with Ollama

Ollama makes running local models easy:
1

Install Ollama

# macOS
brew install ollama

# Linux
curl https://ollama.ai/install.sh | sh

# Windows
# Download from https://ollama.ai
2

Start Ollama

ollama serve
Runs on http://localhost:11434
3

Pull Models

# OpenCode
ollama pull opencode

# Qwen Code (choose size)
ollama pull qwen2.5-coder:7b    # Small, fast
ollama pull qwen2.5-coder:14b   # Balanced
ollama pull qwen2.5-coder:32b   # Best quality
4

Configure Forge

Edit .forge/config.json:
{
  "llms": {
    "opencode": {
      "type": "ollama",
      "model": "opencode",
      "endpoint": "http://localhost:11434"
    },
    "qwen": {
      "type": "ollama",
      "model": "qwen2.5-coder:32b",
      "endpoint": "http://localhost:11434"
    }
  }
}
5

Test

forge task create \
  --title "Test local model" \
  --description "Print 'Hello from OpenCode!'" \
  --llm opencode

Hardware Requirements

Minimum Specs

ModelRAMGPUStorageSpeed
Qwen 7B8GBOptional5GBGood
Qwen 14B16GBRecommended10GBBetter
OpenCode16GBRecommended12GBGood
Qwen 32B32GBRequired25GBBest
For best experience:
CPU:  Modern multi-core (8+ cores)
RAM:  32GB+
GPU:  NVIDIA RTX 3060+ (12GB VRAM)
      or Apple Silicon M1/M2/M3

Storage: SSD with 50GB+ free space
Apple Silicon users: Metal acceleration makes M1/M2/M3 excellent for local models!

Configuration

Basic Ollama Setup

{
  "llms": {
    "opencode": {
      "type": "ollama",
      "model": "opencode",
      "endpoint": "http://localhost:11434"
    }
  }
}

Advanced Configuration

{
  "llms": {
    "qwen": {
      "type": "ollama",
      "model": "qwen2.5-coder:32b",
      "endpoint": "http://localhost:11434",
      "options": {
        "temperature": 0.7,
        "num_ctx": 32768,      // Context window
        "num_predict": 2048,   // Max output tokens
        "top_p": 0.9,
        "top_k": 40,
        "repeat_penalty": 1.1
      },
      "timeout": 120000  // 2 minutes
    }
  }
}

GPU Acceleration

Enable GPU support for faster inference:
{
  "llms": {
    "qwen": {
      "type": "ollama",
      "model": "qwen2.5-coder:32b",
      "options": {
        "num_gpu": 1,         // Number of GPUs
        "main_gpu": 0,        // Primary GPU index
        "low_vram": false     // Enable for under 8GB VRAM
      }
    }
  }
}

Strengths of Local Models

Complete Privacy

No Data Leakage

  • Code never leaves your machine
  • No API calls to external services
  • No telemetry or tracking
  • Perfect for sensitive codebases

Compliance-Ready

  • GDPR compliant by design
  • No third-party data sharing
  • Full audit trail
  • Meets enterprise security requirements

No Usage Limits

# Unlimited usage!
forge task create "Task 1" --llm qwen
forge task create "Task 2" --llm qwen
forge task create "Task 3" --llm qwen
# ... as many as you want

# No rate limits
# No token costs
# No monthly fees

No Internet Required

Work anywhere:
  • ✈️ On airplanes
  • 🏔️ Remote locations
  • 🔌 During outages
  • 🔒 Air-gapped environments

Limitations

Lower Quality

Local models are less capable than cloud models:
Task TypeLocalClaudeGPT-4
Simple fixes⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Architecture⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Bug fixing⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Testing⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐

Slower

# Speed comparison: "Add authentication"
Gemini Flash:  2m 15s
Claude Sonnet: 5m 22s
GPT-4 Turbo:   6m 01s
Qwen 32B:      12m 30s  (local)
OpenCode:      15m 45s  (local)

Hardware Intensive

  • Requires powerful machine
  • GPU strongly recommended
  • High RAM usage
  • Slower on CPU-only

Best Use Cases

Privacy-Sensitive Work

# Medical records processing
forge task create \
  --title "Process patient data" \
  --files "data/patients/*.csv" \
  --llm qwen  # Stays local!

Learning & Experimentation

# Unlimited free usage for learning
forge task create "Try approach 1" --llm opencode
forge task create "Try approach 2" --llm opencode
forge task create "Try approach 3" --llm opencode
# ... no cost!

Air-Gapped Environments

# Government, military, high-security environments
# No internet connection required
forge task create "Classified work" --llm qwen

Cost Reduction

# After initial hardware investment
# Zero ongoing costs
forge task create "Feature 1" --llm qwen  # $0
forge task create "Feature 2" --llm qwen  # $0
forge task create "Feature 3" --llm qwen  # $0

Model Comparison

OpenCode vs Qwen

FeatureOpenCodeQwen 7BQwen 14BQwen 32B
Quality⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Speed⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
RAM16GB8GB16GB32GB
Best forGeneralQuick tasksBalancedQuality

Local vs Cloud

FeatureLocal (Qwen 32B)Claude SonnetGemini Pro
Privacy⭐⭐⭐⭐⭐⭐⭐⭐⭐
Quality⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Speed⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Cost$$$ hardware, $0 usage$0 hardware, $$$ usage$0 hardware, $$ usage

Performance Optimization

Use GPU

# Check GPU usage
nvidia-smi

# Or on Mac
system_profiler SPDisplaysDataType

Adjust Context Window

{
  "llms": {
    "qwen": {
      "options": {
        "num_ctx": 16384  // Reduce for speed
      }
    }
  }
}

Use Smaller Models for Simple Tasks

# Simple task → Qwen 7B (fast)
forge task create "Add comment" --llm qwen-7b

# Complex task → Qwen 32B (quality)
forge task create "Refactor architecture" --llm qwen-32b

Troubleshooting

Error: “Connection refused to localhost:11434”Solution:
# Start Ollama
ollama serve

# Or as background service (Linux)
systemctl start ollama

# macOS (via brew services)
brew services start ollama
Error: “Failed to allocate memory”Solutions:
  • Use smaller model (7B instead of 32B)
  • Close other applications
  • Enable low VRAM mode:
    {
      "llms": {
        "qwen": {
          "options": {
            "low_vram": true
          }
        }
      }
    }
    
  • Upgrade RAM
Issue: Model taking foreverSolutions:
  • Enable GPU acceleration
  • Use smaller model
  • Reduce context window
  • Close background apps
  • Check CPU usage (should be high)
Error: “Model ‘qwen2.5-coder:32b’ not found”Solution:
# Pull the model
ollama pull qwen2.5-coder:32b

# List available models
ollama list

# Check model size before pulling
ollama show qwen2.5-coder:32b

Cost Analysis

Hardware Investment

One-time costs:
├── Mid-range GPU (RTX 4060): $300
├── Additional RAM (32GB): $100
└── SSD storage (1TB): $80
────────────────────────────
Total: ~$480

Break-even vs Claude:
$480 / $50/month = ~10 months of usage

Ongoing Costs

Electricity:
├── GPU power: ~200W
├── Running 8hrs/day
└── Cost: ~$5-10/month

vs Cloud APIs:
├── Claude: $50-200/month typical usage
├── GPT-4: $80-300/month
└── Gemini: $0-50/month (free tier)
Heavy users (>$100/month on APIs) break even quickly with local setup!

Best Practices

Use for Sensitive Work

# Proprietary code
forge task create \
  --title "Refactor proprietary algorithm" \
  --llm qwen  # Stays local

Start Small

Begin with Qwen 7B:
ollama pull qwen2.5-coder:7b
Upgrade to 32B if needed

Monitor Resources

# Check resource usage
htop
nvidia-smi  # GPU

# Adjust if needed

Combine with Cloud

# Quick iteration: local
forge task create "Try A" --llm qwen

# Final polish: cloud
forge task fork 1 --llm claude
Best of both worlds!

Hybrid Strategy

Combine local and cloud models:

Strategy 1: Privacy Tiers

# Sensitive code → local
forge task create \
  --title "Process customer data" \
  --llm qwen

# Public code → cloud (faster, better)
forge task create \
  --title "Update README" \
  --llm gemini

Strategy 2: Cost Optimization

# Experimentation → local (free)
forge task create "Try approach A" --llm qwen
forge task create "Try approach B" --llm qwen
forge task create "Try approach C" --llm qwen

# Production → cloud (quality)
forge task fork <winner> --llm claude

Strategy 3: Network-Aware

# Offline → local
if ping -c 1 8.8.8.8 >/dev/null 2>&1; then
  LLM=gemini  # Online, use fast cloud
else
  LLM=qwen    # Offline, use local
fi

forge task create "Task" --llm $LLM

Real-World Example

Setup for Privacy-First Development

# 1. Install Ollama
brew install ollama

# 2. Pull Qwen 32B (best quality)
ollama pull qwen2.5-coder:32b

# 3. Configure Forge
cat > .forge/config.json <<EOF
{
  "llms": {
    "qwen": {
      "type": "ollama",
      "model": "qwen2.5-coder:32b",
      "endpoint": "http://localhost:11434",
      "options": {
        "num_gpu": 1,
        "num_ctx": 32768
      }
    }
  }
}
EOF

# 4. Start building (all local!)
forge task create "Sensitive feature" --llm qwen

Next Steps