LLM Workload Performance Optimization¶
Metrics Matrix¶
| Axis | Example | Measurement Method | Typical Tradeoff |
|---|---|---|---|
| Latency | p95 response | Time measurement | Reasoning depth vs time |
| Cost | $/request | Token billing aggregation | Model size vs quality |
| Quality | Accuracy/structured rate | Auto-scoring on eval set | Degrades when prioritizing speed |
| Safety | Harmful generation rate | Filter logs | Guard⇒latency |
Optimization Levers¶
| Lever | Concrete Strategy | Caution |
|---|---|---|
| Input reduction | Context summarization/embedding search | Summarization degradation |
| Output reduction | JSON schema constraints | Reduced flexibility |
| Parallelization | Multi-subtask | Rate limits |
| Caching | Vector/response | Storage cost |
| Model selection | Routing (light→heavy) | Routing errors |
Back to: index.md