AI Engineering
Translation status
This English page provides a localized entry and navigation shell. The full article body is currently available in Chinese.
This topic focuses on the engineering realities of production AI systems, including inference cost, KV quantization, PagedAttention, VRAM planning, compute organization, and deployment optimization.
Featured reading
Estimating LLM Inference Cost with Precision VRAM Requirements for Training and Fine-Tuning Large Models KV Quantization: The Cost-Saving Trick in LLM Inference
Article list
- How to Think About Output Length in Large Models
- The Complexity of Software Stacks for Domain-Specific Accelerators
- How Many Tokens Does One Parameter Need for Training?
- VRAM Requirements for Training and Fine-Tuning Large Models
- Hybrid Heterogeneous Compute Clusters in the LLM Era
- How Many Chinese Characters Fit in One Token?
- PagedAttention in Practice
- Understanding vLLM's PagedAttention
- AI Chips Explained: GPU, TPU, and Compute-in-Memory
- How Large Models Actually Use Tools
- KV Quantization: The Cost-Saving Trick in LLM Inference
- Estimating LLM Inference Cost with Precision