Blog
In-depth technical discussions, system architecture breakdowns, and insights from our research community.
-
Building an LLM Inference Engine from Scratch (Part 2)
Part 2 of building an LLM inference engine. Learn how Continuous Batching and Chunked Prefill maximize throughput by scheduling at iteration-level and preventing long prompts from blocking decode, with GPU and CPU benchmark experiments.
-
Building an LLM Inference Engine from Scratch (Part 1)
An educational deep-dive into building an LLM inference engine from scratch. Learn how PagedAttention solves memory fragmentation through block-based KV cache management, with detailed C++ code examples.
-
Beyond the Token Wall: How REFORM Redefines Long-Context AI Inference
Explore REFORM, a groundbreaking method by Woomin Song and colleagues at KAIST that combines recurrent compression with random access for efficient long-context AI inference.
-
Scaling AI Inference: How NVIDIA Dynamo Delivers High-Performance Open Source Serving
William Arnold from NVIDIA explains Dynamo, a platform for high-performance open source inference using disaggregated serving to scale LLM workloads.
-
dstack: The New Default GPU Orchestration Stack
Andrey Cheptsov details dstack, an open-source GPU-centric orchestrator designed to simplify GPU provisioning, development, training, and inference.
-
Optimizing Embedding Model Inference: Balancing Throughput and Latency
Insights from Philip Kiely of Baseten on optimizing embedding model inference, balancing throughput and latency, and using TensorRT-LLM and quantization.
-
4 Surprising Truths About Scaling Reinforcement Learning to Production
Practical strategies for system-level optimization in large-scale RL environments. Learn about the long-tail effect, partial rollout in SGLang, CUDA Graph Aware Refit, and solutions for the Training-Inference Mismatch problem.
-
Why Your AI Gives Different Answers: The Deep-Seated Bug You've Never Heard Of
Exploring how floating-point non-associativity affects determinism and reproducibility in deep learning. Learn why LLMs show non-deterministic outputs even at temperature 0, and how GPU hardware design influences accuracy, speed, and reproducibility.
-
Edge AI and Hardware Co-Design
A comprehensive exploration of Edge AI deployment strategies, covering immutable operating systems, GPU integration with Kubernetes, hardware co-design, and the challenges of deploying AI at the edge.
-
Understanding High Throughput LLM Inference Systems
An architectural deep dive into vLLM, exploring PagedAttention, optimized KV caching, chunked prefill, and advanced features that enable efficient LLM serving at scale.
Want to contribute? Reach out at daniel@aerlabs.tech or shubham@aerlabs.tech