HubLensLLMdeepseek-ai/DeepGEMM
// archived 2026-04-29
deepseek-ai

DeepGEMM

AI#CUDA#LLM#GPU#Deep Learning#Optimization
View on GitHub
7,104

// summary

DeepGEMM is a unified CUDA library providing high-performance tensor core kernels specifically optimized for modern large language models. It features a lightweight Just-In-Time compilation module that eliminates the need for CUDA compilation during installation. The library delivers expert-tuned performance for various matrix operations, including FP8, FP4, and BF16 GEMMs, as well as fused MoE and MQA scoring.

// technical analysis

DeepGEMM is a high-performance, unified CUDA kernel library designed to provide essential computation primitives for modern large language models, including FP8/FP4 GEMMs, fused MoE, and MQA scoring. By utilizing a lightweight Just-In-Time (JIT) compilation module, the library eliminates the need for complex CUDA compilation during installation while maintaining performance that rivals expert-tuned libraries. The project prioritizes simplicity and accessibility by avoiding heavy reliance on complex template metaprogramming, offering a clean codebase for developers to study and implement advanced NVIDIA GPU kernel optimizations.

// key highlights

01
Provides high-performance GEMM kernels supporting multiple data formats including FP8, FP4, and BF16.
02
Implements Mega MoE, a fused kernel that overlaps NVLink communication with tensor core computation to maximize throughput.
03
Features a lightweight JIT module that compiles kernels at runtime, removing the burden of manual CUDA compilation.
04
Includes specialized MQA scoring kernels designed for the lightning indexer, supporting both non-paged and paged memory layouts.
05
Supports advanced hardware features like SM90 and SM100 architectures with optimized TMA-aligned memory operations.
06
Offers flexible grouped GEMM APIs for contiguous and masked layouts, specifically tailored for efficient MoE expert processing.

// use cases

01
High-performance FP8, FP4, and BF16 GEMM operations for LLMs
02
Mega MoE kernels with fused communication and tensor core computation
03
MQA scoring kernels for lightning indexers in large-scale inference

// getting started

To begin, clone the repository recursively using 'git clone --recursive' to ensure all submodules are included. Run the provided 'develop.sh' script to link essential includes and build the C++ JIT module, followed by 'install.sh' to finalize the installation. Once installed, you can import the 'deep_gemm' module directly into your Python environment to access the optimized kernels.