Decoding How AI Works

Our vision is not just to optimize AI systems, but to remove the critical bottlenecks that prevent widespread adoption.

30+ Researchers
Global Community
Open Source First

What We Do

Our work is fundamentally about enabling efficient AI inference at scale. We are tackling the critical computational and memory bottlenecks that currently limit the widespread deployment of large language models.

Our approach is built on two core pillars:

01

Full-Stack Co-Design

We don't just look at the model; we analyze the entire system stack. By diving deep into the mechanics of operations like attention, we're building frameworks to co-optimize the algorithm and the underlying hardware pipeline. The goal is to move beyond the 'black box' and engineer for peak performance and minimal resource consumption.

02

Democratizing Deployment

Our ultimate vision is to make powerful AI accessible beyond large-scale cloud data centers. We are engineering solutions that allow these complex models to run efficiently across diverse hardware platforms—whether on-premises, in the cloud, or critically, at the edge. This is about reducing the cost and latency of inference to make AI a practical tool for every enterprise.

Current Projects

AER-Q: Hardware-Aware Foundation Model

Active

A 20B parameter foundation model co-designed for ultra-efficient inference. Using gradient-based sensitivity analysis, we integrate quantization-awareness directly into pre-training, achieving 2-3x reduction in latency and memory footprint while maintaining SOTA performance.

Quantization Model Training Co-Design

AI Hardware Router

Active

A self-contained edge computing architecture for on-device AI with complete data sovereignty. Features hierarchical inference with on-device SLM for low latency queries and cloud offloading with PII masking for complex tasks, pioneering privacy-first AI deployment.

Edge AI Privacy Systems

Multi-Model Agentic Platform

Active

High-throughput serving infrastructure for heterogeneous LLM/VLM mixtures. Hardware-aware architecture with tensor parallelism and speculative decoding, targeting sub-2s TTFT and 1000+ tokens/s throughput for complex multi-hop agentic workflows.

Serving Agents Infrastructure

Our Labs

AI Inference Lab - ModuLabs South Korea (30+ Members)

Bringing together a world-class team of researchers and engineers from Samsung, Google, Trillion Labs, ETRI, Seoul National University, KAIST, and other leading technology firms and universities, AI Inference Lab combines deep expertise in large language models, system optimization, and production infrastructure.

We decode how large language models think, making them smarter. We're not chasing AGI but enabling AI in daily life. We focus on solving the real-world challenges of deploying AI at scale, making this technology achievable for every enterprise, whether on cloud, on-premises, or at the edge.

Want to join, send your profile to daniel@aerlabs.tech

Learn More →

AI Inference Lab - India

We're building a community of researchers passionate about open source AI research in India.

If you're interested in working on cutting-edge open source AI research and contributing to the future of AI inference optimization, we'd love to hear from you.

Want to join, send your profile to shubham@aerlabs.tech

Knowledge Sharing

We host regular technical discussions and deep-dive sessions on AI inference, system optimization, and LLM deployment. Our community brings together researchers and engineers to share knowledge and advance the field together.

These sessions are documented as in-depth technical articles on our blog. Explore our blog →

Interested in our research? Connect with us at daniel@aerlabs.tech or shubham@aerlabs.tech