Building LLM-Agnostic: Why Your AI Product Needs to Survive the Next Model Drop

Architecture

If you’ve been building with LLMs for more than a few months, you’ve lived through this: you architect your product around Model A, get it production‑ready, and then Model B drops with better reasoning, a larger context window, and cheaper tokens. Two weeks later, another provider ships a breaking upgrade. Pricing changes. Latency improves. Your carefully tuned eval suite is suddenly stale.

Welcome to the fastest‑moving infrastructure layer in software history. The teams that survive treat models as swappable infrastructure — not hard dependencies. In other words: they build LLM‑agnostic.

The problem: you’re building on quicksand

Capabilities double every 6–9 months. What needed few‑shot last quarter often works zero‑shot now.
Unit economics swing wildly. Token prices and throughput change overnight; cost curves reset.
Context windows explode. Retrieval and chunking strategies that worked at 8K break at 128K–2M.
Providers ship fast and break assumptions. If you’re tied to one SDK, you inherit their churn.

Pattern #1 — Adapter layer: your abstraction is your moat

Never call provider SDKs directly from business logic. Wrap every call behind a thin adapter that standardizes inputs, outputs, errors, and metrics. Then your app code depends on your interface — not OpenAI, Anthropic, or Google.

Adapter interface

// lib/llm/adapter.ts
export interface LLMAdapter {
  generate(params: GenerateParams): Promise<GenerateResponse>;
  stream?(params: GenerateParams): AsyncIterable<StreamChunk>;
}

export interface Message {
  role: 'system' | 'user' | 'assistant' | 'tool';
  content: string;
  name?: string;
}

export interface Tool {
  name: string;
  description?: string;
  jsonSchema?: Record<string, unknown>;
}

export interface ToolCall {
  name: string;
  arguments: Record<string, unknown>;
}

export interface GenerateParams {
  messages: Message[];
  model?: string;
  temperature?: number;
  maxTokens?: number;
  tools?: Tool[];
}

export interface GenerateResponse {
  content: string;
  toolCalls?: ToolCall[];
  usage: { promptTokens: number; completionTokens: number; totalCost: number };
  latency: number;
  model: string;
}

Provider-specific implementations (sketch)

// lib/llm/adapters/openai.ts
export class OpenAIAdapter implements LLMAdapter {
  constructor(private client: any) {}
  async generate(params: GenerateParams): Promise<GenerateResponse> {
    const res = await this.client.chat.completions.create({
      model: params.model || 'gpt-4o',
      messages: params.messages,
      temperature: params.temperature,
      max_tokens: params.maxTokens,
      tools: params.tools ? convertTools(params.tools) : undefined
    });
    return normalizeOpenAI(res);
  }
}

// lib/llm/adapters/anthropic.ts
export class AnthropicAdapter implements LLMAdapter {
  constructor(private client: any) {}
  async generate(params: GenerateParams): Promise<GenerateResponse> {
    const res = await this.client.messages.create({
      model: params.model || 'claude-3-5-sonnet-20241022',
      messages: params.messages,
      temperature: params.temperature,
      max_tokens: params.maxTokens,
      tools: params.tools ? convertTools(params.tools) : undefined
    });
    return normalizeAnthropic(res);
  }
}

Application code stays provider-agnostic

// app/services/summarizer.ts
import { getLLM } from '@/lib/llm';

export async function summarizeDocument(text: string) {
  const llm = getLLM(); // returns configured adapter
  const res = await llm.generate({
    messages: [
      { role: 'system', content: 'You are a precise summarization assistant.' },
      { role: 'user', content: 'Summarize:\n\n' + text }
    ],
    temperature: 0.3,
    maxTokens: 500
  });
  return res.content;
}

Why it matters: Swap providers via config, A/B test models in prod, route by latency/cost, and fall back on outages — without touching business logic.

Pattern #2 — Routing layer: right model, right task, right time

Not every task deserves your frontier model. Route by task complexity, latency needs, context size, and budget.

// lib/llm/router.ts
type Task = { 
  type: 'classification' | 'extraction' | 'reasoning' | 'chat'; 
  requiresTools?: boolean; 
  classes?: string[] 
};

export class LLMRouter {
  constructor(private getAdapter: (model: string) => any) {}

  async route(task: Task, params: GenerateParams) {
    if (task.type === 'classification' && (task.classes?.length || 0) <= 10) {
      return this.getAdapter('gpt-4o-mini'); // cheap & fast
    }
    if (task.requiresTools || task.type === 'reasoning') {
      return this.getAdapter('claude-3-5-sonnet-20241022'); // strong reasoning
    }
    if ((params as any).longContext) {
      return this.getAdapter('gemini-1.5-pro'); // huge window
    }
    return this.getAdapter('gpt-4-turbo'); // balanced default
  }
}

Real-world result: We cut costs ~60% and doubled throughput by routing simple tasks to cheap models, reserving frontier models for hard stuff, and using long‑context models only when needed.

Pattern #3 — Evaluation harness: trust, but verify

When a new model drops, you need to know fast if it’s better for your use case. Build continuous evals that score quality, cost, and latency. Run weekly; gate rollouts on data.

// lib/evals/harness.ts
export interface EvalCase {
  id: string;
  input: GenerateParams;
  expected: { contains?: string[]; format?: 'json'|'markdown'|'code'; minQuality?: number };
}

export class EvalHarness {
  constructor(private llm: any) {}
  async runEvals(model: string, cases: EvalCase[]) {
    const results = await Promise.all(cases.map(async (c) => {
      const t0 = Date.now();
      const r = await this.llm.generate({ ...c.input, model });
      return {
        id: c.id,
        passed: this.evaluate(r, c.expected),
        latency: Date.now() - t0,
        cost: r.usage.totalCost,
        output: r.content
      };
    }));
    return this.aggregate(results, model);
  }
  private evaluate(r: any, expected: EvalCase['expected']) {/* implement checks */}
  private aggregate(rows: any[], model: string) {
    const passRate = rows.filter(r => r.passed).length / rows.length;
    const avgLatency = rows.reduce((a, b) => a + b.latency, 0) / rows.length;
    const avgCost = rows.reduce((a, b) => a + b.cost, 0) / rows.length;
    return { model, passRate, avgLatency, avgCost };
  }
}

// scripts/compare-models.ts (example output)
┌─────────────────────────────┬──────────┬─────────────┬──────────┐
│ model                       │ passRate │ avgLatency  │ avgCost  │
├─────────────────────────────┼──────────┼─────────────┼──────────┤
│ gpt-4-turbo                 │ 0.94     │ 2.3s        │ $0.042   │
│ gpt-4o                      │ 0.96     │ 1.1s        │ $0.021   │
│ claude-3-5-sonnet-20241022  │ 0.98     │ 1.8s        │ $0.035   │
│ gemini-1.5-pro              │ 0.92     │ 3.1s        │ $0.018   │
└─────────────────────────────┴──────────┴─────────────┴──────────┘

Pattern #4 — Feature flags: deploy models like code

Ship model changes behind feature flags with gradual rollouts — by percentage, cohort, or context constraints.

// lib/llm/config.ts
export const modelConfig = {
  summarization: {
    default: 'gpt-4-turbo',
    experiments: {
      sonnet10: { model: 'claude-3-5-sonnet-20241022', rollout: 10, users: ['beta'] }
    }
  },
  extraction: {
    default: 'gpt-4o-mini',
    experiments: {
      gemini5: { model: 'gemini-1.5-pro', rollout: 5, minContextLength: 50000 }
    }
  }
};

export function getModelForTask(task: string, user: { id: string; cohort?: string }, ctx: { contextTokens?: number }) {
  const cfg = (modelConfig as any)[task];
  for (const exp of Object.values(cfg.experiments || {})) {
    const e = exp as any;
    const inCohort = !e.users || e.users.includes(user.cohort);
    const passesCtx = !e.minContextLength || (ctx.contextTokens || 0) >= e.minContextLength;
    const inRollout = Math.random() * 100 < e.rollout;
    if (inCohort && passesCtx && inRollout) return e.model;
  }
  return cfg.default;
}

Rollout cadence: 5% (beta) → 25% → 50% → 100% — with clear rollback criteria (quality, latency, error rate, CSAT).

Pattern #5 — Prompt registry: version prompts like code

Prompts are code. Version them, AB‑test them, and tie outputs to prompt + model versions for traceability.

// lib/prompts/registry.ts
export const prompts = {
  summarization: {
    v1: {
      system: 'You are a precise summarization assistant.',
      user: (text: string) => 'Summarize in 3-5 bullets:\n\n' + text,
      deprecated: true
    },
    v2: {
      system: 'World-class summarizer. Focus on key insights and action items.',
      user: (text: string) => 'Summarize in 3-5 bullets with actions:\n\n' + text,
      validFrom: '2024-03-01',
      models: ['gpt-4-turbo', 'claude-3-5-sonnet-20241022']
    }
  }
};

export function getPrompt(name: keyof typeof prompts, version?: string) {
  const bucket = prompts[name];
  const v = version || Object.keys(bucket).sort().pop()!;
  return (bucket as any)[v];
}

Pattern #6 — Budgets: make cost and latency first‑class

Every LLM call should carry a cost and latency budget. Select models and parameters to honor those constraints.

// lib/llm/budget.ts
export interface Budget { maxCostPerRequest: number; maxLatency: number; maxTokens: number; }

export async function generateWithBudget(
  llm: any,
  params: GenerateParams,
  budget: Budget
) {
  const model = selectModelForBudget(budget); // your logic here
  const res = await llm.generate({
    ...params,
    model,
    maxTokens: Math.min(params.maxTokens || budget.maxTokens, budget.maxTokens)
  });

  if (res.usage.totalCost > budget.maxCostPerRequest) {
    throw new Error(`Cost budget exceeded: $${res.usage.totalCost} > $${budget.maxCostPerRequest}`);
  }
  if (res.latency > budget.maxLatency) {
    throw new Error(`Latency budget exceeded: ${res.latency}ms > ${budget.maxLatency}ms`);
  }
  return res;
}

Real‑time chat: <500ms latency, <$0.01/request
Document analysis: <10s latency, <$0.50/document
Batch processing: minutes ok, <$0.05/item

Handling 2–3× capability jumps (without rewriting everything)

Simplify when you can: if the new model makes your RAG+finetune unnecessary, delete the complexity.
Re‑evaluate retrieval: as context grows, chunking and ranking strategies should evolve — or disappear.
Continuously evaluate: run your eval suite weekly; switch when quality/cost cross your threshold.
Design one generation ahead: plan for cheaper inference, longer context, better reasoning, and multimodal by default.

LLM‑agnostic checklist

Swap models with config only
Adapter layer abstracts provider APIs
Routing by task, latency, cost, and context size
Automated evals with weekly runs
Feature flags for gradual rollouts and rollbacks
Prompt versioning and traceability
Explicit cost/latency budgets
Observability on quality, cost, latency

The best LLM for your product will change every quarter. Your architecture shouldn’t. Build abstractions, invest in evals and routing, and make tradeoffs explicit.

Talk to us about your architecture See how we lead delivery

← Back to all posts