Intelligent Model Selection: Cost-Optimized AI That Learns

In previous posts, we discussed context engineering and agent routing. But there’s another critical challenge: choosing the right model for each task.

The Problem: Cloud AI Costs Add Up Fast

When building an AI coding assistant, one of the biggest challenges is cost management. Traditional approaches use fixed model assignments:

Planning? Always use Model X
Editing? Always use Model Y
Review? Always use Model Z

This rigid approach has several problems:

Ignores availability - What if Model X hits rate limits?
Wastes money - Why use expensive models for simple tasks?
No learning - Repeats mistakes with failing models
Manual management - Users must constantly tweak configs

The Solution: Multi-Dimensional Model Selection

GPTCode now features an intelligent model selection system that automatically chooses the best model for each action based on:

1. Availability (Highest Priority)

Score penalties:
- 90%+ rate limit usage: -50 points
- Recent errors: -30 points

The system tracks daily usage per model and automatically switches to fallback providers when limits are reached.

2. Cost Optimization

Free models (OpenRouter): No penalty
$0.30/1M tokens: -9 points
$3.00/1M tokens: -90 points

OpenRouter’s free models (Gemini 2.0 Flash, Llama 3.2 3B, etc.) are prioritized, keeping costs near zero for most workloads.

3. Context Window

1M tokens: +100 bonus
128k tokens: +12.8 bonus
8k tokens: +0.8 bonus

Larger context windows handle complex refactorings better, so they get scoring bonuses.

4. Speed

150 tokens/sec: +7.5 bonus
50 tokens/sec: +2.5 bonus

Faster models improve developer experience with quick responses.

How It Works

Automatic Usage Tracking

Every LLM call records:

Backend and model used
Success/failure status
Error messages
Token usage (input/output/cached)

Data stored in ~/.gptcode/usage.json:

{
  "2025-12-02": {
    "openrouter/gemini-2.0-flash-exp:free": {
      "requests": 47,
      "input_tokens": 125000,
      "output_tokens": 8500,
      "cached_tokens": 89000,
      "last_error": null
    }
  }
}

Elegant Stats Dashboard

gptcode stats

Output:

────────────────────────────────────────────────────────────
  Usage Statistics

  Period:              All Time
  Total Requests:      47
  Success Rate:        100.0%

  Token Usage
  Input Tokens:        125.0k
  Output Tokens:       8.5k
  Cached Tokens:       89.0k (71.2% cache hit)

  💡 Cache savings: 89.0k tokens, reducing costs

  Model Usage          Requests  Status
  ────────────────────────────────────────────────────────
  gemini-2.0-flash-exp:free              47  ✓

  » Tip: Use 'gptcode stats --today' for today's activity

────────────────────────────────────────────────────────────

Mode Management

Simple switching between cloud and local execution:

gptcode mode              # Show current mode
gptcode mode cloud        # Use cloud providers (OpenRouter, Groq)
gptcode mode local        # Use Ollama only

Real-World Example

Before (fixed profiles):

profiles:
  router: llama-3.3-70b-versatile  # $0.59/1M, 14k daily limit
  editor: llama-3.3-70b-versatile
  reviewer: llama-3.3-70b-versatile

Cost: ~$5-10/month, frequent rate limits

After (intelligent selection):

mode: cloud  # That's it!

The system automatically:

Tries Gemini 2.0 Flash (free, 1M context, 150 tok/s)
Falls back to Llama 3.2 3B (free) if rate limited
Uses Groq’s paid models only when free tier exhausted
Learns from failures and avoids problematic models

Cost: ~$0-2/month, zero rate limit issues

The Model Catalog

~/.gptcode/models_catalog.json defines available models:

{
  "openrouter": {
    "models": [{
      "id": "google/gemini-2.0-flash-exp:free",
      "cost_per_1m": 0,
      "rate_limit_daily": 1000,
      "context_window": 1000000,
      "tokens_per_sec": 150,
      "capabilities": {
        "supports_tools": true,
        "supports_file_operations": true
      }
    }]
  }
}

Enrich with new models:

python3 ml/scripts/enrich_catalog.py

ML-Powered Learning (Coming Soon)

The system already records feedback after each execution:

Success: +20 score bonus
Failure: -40 score penalty

After 20-30 tasks, train an ML classifier:

python3 ml/model_selection/train.py

The model learns patterns like:

“For Go package refactoring, prefer Model A”
“For Python simple edits, Model B works fine”
“Model C fails on Elixir, avoid it”

Summary

Intelligent model selection delivers:

10x cost reduction via free model prioritization
Zero downtime with automatic fallback
Better UX with speed-optimized choices
Continuous learning from historical feedback

Try it today:

gptcode mode cloud
gt do "refactor auth module"
gptcode stats

Have questions about model selection? Join our GitHub Discussions