The following technical blog details how I built, and how you too can build, a Llama 3 inference engine from scratch in CUDA C. When I was working on this project, my goal was to learn about the CUDA programming and hardware architecture, and leverage strengths in accelerated computing to run an LLM.
I always try to build projects with great inspiration, and this project is no different. LLMs have been growing at a rapid rate with a strong focus on the open source community. llama.cpp and ollama are examples of great tools that are being actively developed to allow anyone to spin up and run almost any state-of-the-art LLM locally on most CPUs and GPUs.
Features of Llama3.cu
Llama3.cu does not include all the bells and whistles compared to other SOTA inference engines, but should provide a great foundation of the CUDA programming and hardware architecture, and how common ML algorithms are implemented in the parallel programming paradigm. Features:
- Uses CUDA C for all functionalities
- Implements the Llama 3 tokenizer
- Implements custom CUDA kernels for:
- Token to embedding lookup
- Root mean squared layer norm (RMSNorm)
- Tiled General Matrix Multiply (Tiled GEMM)
- Grouped Multi-Query Attention
- A fused kernel for the feedforward network
- Executes all kernels within CUDA cores
- Stores all parameters in FP16 (2 bytes per parameter)
Before starting, ML optimizations such as kv caching with quant, batching of multiple requests (in the case of client-server interaction), speculative decoding, or paged attention can significantly speed up LLM inference, but these were out of scope for this project.