// summary
DeepGEMM is a unified CUDA library providing high-performance tensor core kernels specifically optimized for modern large language models. It features a lightweight Just-In-Time compilation module that eliminates the need for CUDA compilation during installation. The library delivers expert-tuned performance for various matrix operations, including FP8, FP4, and BF16 GEMMs, as well as fused MoE and MQA scoring.
// technical analysis
DeepGEMM is a high-performance, unified CUDA kernel library designed to provide essential computation primitives for modern large language models, including FP8/FP4 GEMMs, fused MoE, and MQA scoring. By utilizing a lightweight Just-In-Time (JIT) compilation module, the library eliminates the need for complex CUDA compilation during installation while maintaining performance that rivals expert-tuned libraries. The project prioritizes simplicity and accessibility by avoiding heavy reliance on complex template metaprogramming, offering a clean codebase for developers to study and implement advanced NVIDIA GPU kernel optimizations.
// key highlights
// use cases
// getting started
To begin, clone the repository recursively using 'git clone --recursive' to ensure all submodules are included. Run the provided 'develop.sh' script to link essential includes and build the C++ JIT module, followed by 'install.sh' to finalize the installation. Once installed, you can import the 'deep_gemm' module directly into your Python environment to access the optimized kernels.