Inference

Tencent / ncnn

ncnn is a high-performance neural network forward computation framework specifically optimized for mobile platforms, designed to simplify the deployment of deep learning algorithms on mobile devices. The framework has no third-party dependencies and features cross-platform capabilities, with execution speeds on mobile CPUs that outperform all currently known open-source frameworks. Currently, ncnn is widely used in various mainstream applications under Tencent, helping developers easily build intelligent applications.

89

Tencent / ncnn

ncnn is a high-performance neural network forward computation framework deeply optimized for mobile platforms. The framework has no third-party dependencies and features cross-platform capabilities, outperforming all known open-source frameworks on mobile CPUs. Developers can easily port deep learning models to mobile devices using ncnn to build various intelligent applications.

87

alibaba / MNN

MNN is a high-performance, lightweight deep learning framework designed for efficient model inference and training on mobile and embedded devices. It supports a wide range of neural network architectures and provides versatile tools for model conversion, compression, and general-purpose computation. The framework is widely used in production environments, including various Alibaba applications, to enable device-cloud collaborative machine learning.

81

PaddlePaddle / FastDeploy

FastDeploy is an inference deployment toolkit for large language models and vision-language models based on PaddlePaddle, designed to provide out-of-the-box production-grade deployment solutions. This tool supports various mainstream hardware platforms and integrates load-balanced PD separation, unified KV cache transmission, and multiple advanced acceleration technologies. Developers can achieve rapid deployment through OpenAI API-compatible interfaces and optimize inference performance using full quantization format support.

71

alibaba / rtp-llm

RTP-LLM is a high-performance LLM inference acceleration engine developed by the Alibaba Foundation Model Inference team. This engine has been widely applied in various Alibaba business scenarios such as Taobao and Tmall, supporting multiple mainstream model formats and hardware backends. It provides efficient production-level services for large language models by integrating advanced operator optimization, quantization techniques, and distributed inference capabilities.

70

PaddlePaddle / FastDeploy

FastDeploy is an inference deployment toolkit for large language models and vision-language models based on PaddlePaddle, aiming to provide out-of-the-box production-grade deployment solutions. The toolkit supports various mainstream hardware platforms and integrates core technologies such as load-balanced PD separation, unified KV cache transmission, and full quantization format support. By being compatible with OpenAI API and vLLM interfaces, it helps developers efficiently implement model inference and online service deployment.

68

alibaba / rtp-llm

RTP-LLM is a high-performance large model inference acceleration engine developed by the Alibaba Foundation Model Inference Team, widely used in various business scenarios such as Taobao and Tmall. By integrating various advanced CUDA kernels and quantization techniques, the engine significantly improves model inference performance and efficiency. Furthermore, it possesses high flexibility, supporting multiple model formats, multimodal inputs, and LoRA service deployment.

68

toverainc / willow

The Willow Inference Server allows users to self-host high-speed language inference tasks for various applications. It supports essential features including speech-to-text, text-to-speech, and large language model processing. Users can access official documentation and community support through the project's website and GitHub discussions.

67

alibaba / tair-kvcache

Tair KVCache is an Alibaba Cloud system designed to accelerate Large Language Model inference through distributed memory pooling and dynamic multi-level caching. The project provides a centralized manager for global KVCache metadata and storage capacity, ensuring efficient data reliability and resource utilization. Additionally, it includes a high-fidelity simulation tool that allows developers to predict performance metrics without requiring actual GPU resources.

62

google-ai-edge / LiteRT-LM

LiteRT-LM is a high-performance, production-ready inference framework designed by Google for deploying Large Language Models on edge devices. It supports a wide range of platforms including Android, iOS, desktop, and IoT, while leveraging GPU and NPU hardware acceleration for optimal performance. The framework enables advanced capabilities such as multi-modality and function calling, powering on-device AI experiences in various Google products.

54

alexzhang13 / rlm

Recursive Language Models (RLMs) provide a task-agnostic inference paradigm that enables language models to handle near-infinite contexts through programmatic decomposition and recursive self-calling. The framework replaces standard completion calls with an RLM-specific interface that offloads context into a REPL environment for interactive execution. This repository offers an extensible engine supporting various local and cloud-based sandbox environments to facilitate complex, multi-step language model reasoning.

49

mnfst / awesome-free-llm-apis

This repository provides a curated list of LLM API providers that offer permanent free tiers for text inference. It categorizes services into direct provider APIs and third-party inference platforms, detailing model capabilities, context windows, and rate limits. The collection serves as a comprehensive resource for developers seeking cost-effective access to various large language models.

43

baidu / vLLM-Kunlun

vLLM Kunlun is a community-maintained hardware plugin that enables the seamless execution of vLLM on Kunlun XPU devices. It functions as a hardware-pluggable interface, allowing users to run various large language and multimodal models without modifying the original vLLM source code. The project supports advanced features like quantization, LoRA fine-tuning, and hardware-accelerated graph optimization to ensure high-performance inference.

40

Michael-A-Kuykendall / shimmy

Shimmy is a lightweight, single-binary server that provides a 100% OpenAI-compatible API for running GGUF models locally. It features zero-configuration model discovery, automatic GPU backend detection, and advanced CPU/GPU hybrid processing for large models. Designed for privacy and performance, it allows developers to integrate local LLMs into existing tools without code changes.

37

// new this month

// ecosystem

// recent newcomers

// this week's top 6

// all-time featured (14)

// use cases by project

// comparisons

// related topics