deepseek-ai

DeepGEMM

AI#CUDA#LLM#GPU#Deep Learning#Optimization

7,104

// summary

DeepGEMM is a unified CUDA library providing high-performance tensor core kernels specifically optimized for modern large language models. It features a lightweight Just-In-Time compilation module that eliminates the need for CUDA compilation during installation. The library delivers expert-tuned performance for various matrix operations, including FP8, FP4, and BF16 GEMMs, as well as fused MoE and MQA scoring.

// technical analysis

DeepGEMM is a high-performance, unified CUDA kernel library designed to provide essential computation primitives for modern large language models, including FP8/FP4 GEMMs, fused MoE, and MQA scoring. By utilizing a lightweight Just-In-Time (JIT) compilation module, the library eliminates the need for complex CUDA compilation during installation while maintaining performance that rivals expert-tuned libraries. The project prioritizes simplicity and accessibility by avoiding heavy reliance on complex template metaprogramming, offering a clean codebase for developers to study and implement advanced NVIDIA GPU kernel optimizations.

// key highlights

Provides high-performance GEMM kernels supporting multiple data formats including FP8, FP4, and BF16.

Implements Mega MoE, a fused kernel that overlaps NVLink communication with tensor core computation to maximize throughput.

Features a lightweight JIT module that compiles kernels at runtime, removing the burden of manual CUDA compilation.

Includes specialized MQA scoring kernels designed for the lightning indexer, supporting both non-paged and paged memory layouts.

Supports advanced hardware features like SM90 and SM100 architectures with optimized TMA-aligned memory operations.

Offers flexible grouped GEMM APIs for contiguous and masked layouts, specifically tailored for efficient MoE expert processing.

// use cases

High-performance FP8, FP4, and BF16 GEMM operations for LLMs

Mega MoE kernels with fused communication and tensor core computation

MQA scoring kernels for lightning indexers in large-scale inference

// getting started

To begin, clone the repository recursively using 'git clone --recursive' to ensure all submodules are included. Run the provided 'develop.sh' script to link essential includes and build the C++ JIT module, followed by 'install.sh' to finalize the installation. Once installed, you can import the 'deep_gemm' module directly into your Python environment to access the optimized kernels.