Technical Blog

In-depth technical discussions, system architecture breakdowns, and insights from our research community.

Articles

December 2025 AI Research

Beyond the Token Wall: How REFORM Redefines Long-Context AI Inference

Explore REFORM, a groundbreaking method by Woomin Song and colleagues at KAIST that combines recurrent compression with random access for efficient long-context AI inference.

Woomin Song PhD Scholar at KAIST
Long Context Inference Research
December 2025 AI Inference

Scaling AI Inference: How NVIDIA Dynamo Delivers High-Performance Open Source Serving

William Arnold from NVIDIA explains Dynamo, a platform for high-performance open source inference using disaggregated serving to scale LLM workloads.

William Arnold NVIDIA Dynamo Team
NVIDIA Dynamo Inference Scaling
December 2025 ML Infrastructure

dstack: The new default GPU orchestration stack

Andrey Cheptsov details dstack, an open-source GPU-centric orchestrator designed to simplify GPU provisioning, development, training, and inference.

Andrey Cheptsov Core Maintainer of dstack
dstack Orchestration GPU MLOps
December 2025 AI Inference

Optimizing Embedding Model Inference: Balancing Throughput and Latency

Insights from Philip Kiely of Baseten on optimizing embedding model inference, balancing throughput and latency, and using TensorRT-LLM and quantization.

Philip Kiely Early Employee at Baseten
Embeddings Inference Optimization Baseten
October 2025 Reinforcement Learning

4 Surprising Truths About Scaling Reinforcement Learning to Production

Practical strategies for system-level optimization in large-scale RL environments. Learn about the long-tail effect, partial rollout in SGLang, CUDA Graph Aware Refit, and solutions for the Training-Inference Mismatch problem.

Chenyang Zhao AI Researcher, ByteDance, SGLang RL Lead
RL SGLang Systems Optimization
October 2025 Deep Learning Systems

Why Your AI Gives Different Answers: The Deep-Seated Bug You've Never Heard Of

Exploring how floating-point non-associativity affects determinism and reproducibility in deep learning. Learn why LLMs show non-deterministic outputs even at temperature 0, and how GPU hardware design influences accuracy, speed, and reproducibility.

Brian Chau AI Researcher, Founder, IOI Medalist
Floating-Point Reproducibility Hardware GPU
September 2025 Edge AI

Edge AI and Hardware Co-Design

A comprehensive exploration of Edge AI deployment strategies, covering immutable operating systems, GPU integration with Kubernetes, hardware co-design, and the challenges of deploying AI at the edge.

Marco Gonzalez Sr. Software Engineer, Red Hat
Edge AI Hardware Co-Design Infrastructure Deployment
September 2025 vLLM Inference

Understanding High Throughput LLM Inference Systems

An architectural deep dive into vLLM, exploring PagedAttention, optimized KV caching, chunked prefill, and advanced features that enable efficient LLM serving at scale.

Ayush Satyam Software Engineer, Red Hat
vLLM Inference Systems Architecture

Want to contribute to our blog? Reach out at daniel@aerlabs.tech or shubham@aerlabs.tech