HubLensTopicsInference
// topic

Inference

14trending in last 90 days·14all-time

// new this month

// ecosystem

LLM12Deep Learning6Mobile3Computer Vision3CUDA2Inference
AI 14

// recent newcomers

see all newcomers →

// this week's top 6

01
PaddlePaddle / FastDeploy
FastDeploy is an inference deployment toolkit for large language models and vision-language models based on PaddlePaddle, designed to provide out-of-the-box production-grade deployment solutions. This tool supports various mainstream hardware platforms and integrates load-balanced PD separation, unified KV cache transmission, and multiple advanced acceleration technologies. Developers can achieve rapid deployment through OpenAI API-compatible interfaces and optimize inference performance using full quantization format support.
713,681
02
alibaba / rtp-llm
RTP-LLM is a high-performance LLM inference acceleration engine developed by the Alibaba Foundation Model Inference team. This engine has been widely applied in various Alibaba business scenarios such as Taobao and Tmall, supporting multiple mainstream model formats and hardware backends. It provides efficient production-level services for large language models by integrating advanced operator optimization, quantization techniques, and distributed inference capabilities.
701,107
03
toverainc / willow
The Willow Inference Server allows users to self-host high-speed language inference tasks for various applications. It supports essential features including speech-to-text, text-to-speech, and large language model processing. Users can access official documentation and community support through the project's website and GitHub discussions.
673,025
04
alibaba / tair-kvcache
Tair KVCache is an Alibaba Cloud system designed to accelerate Large Language Model inference through distributed memory pooling and dynamic multi-level caching. The project provides a centralized manager for global KVCache metadata and storage capacity, ensuring efficient data reliability and resource utilization. Additionally, it includes a high-fidelity simulation tool that allows developers to predict performance metrics without requiring actual GPU resources.
62157
05
alexzhang13 / rlm
Recursive Language Models (RLMs) provide a task-agnostic inference paradigm that enables language models to handle near-infinite contexts through programmatic decomposition and recursive self-calling. The framework replaces standard completion calls with an RLM-specific interface that offloads context into a REPL environment for interactive execution. This repository offers an extensible engine supporting various local and cloud-based sandbox environments to facilitate complex, multi-step language model reasoning.
4944
06
Michael-A-Kuykendall / shimmy
Shimmy is a lightweight, single-binary server that provides a 100% OpenAI-compatible API for running GGUF models locally. It features zero-configuration model discovery, automatic GPU backend detection, and advanced CPU/GPU hybrid processing for large models. Designed for privacy and performance, it allows developers to integrate local LLMs into existing tools without code changes.
3782

// all-time featured (14)

Tencent / ncnn
ncnn is a high-performance neural network forward computation framework specifically optimized for mobile platforms, designed to simplify the deployment of deep learning algorithms on mobile devices. The framework has no third-party dependencies and features cross-platform capabilities, with execution speeds on mobile CPUs that outperform all currently known open-source frameworks. Currently, ncnn is widely used in various mainstream applications under Tencent, helping developers easily build intelligent applications.
89
Tencent / ncnn
ncnn is a high-performance neural network forward computation framework deeply optimized for mobile platforms. The framework has no third-party dependencies and features cross-platform capabilities, outperforming all known open-source frameworks on mobile CPUs. Developers can easily port deep learning models to mobile devices using ncnn to build various intelligent applications.
87
alibaba / MNN
MNN is a high-performance, lightweight deep learning framework designed for efficient model inference and training on mobile and embedded devices. It supports a wide range of neural network architectures and provides versatile tools for model conversion, compression, and general-purpose computation. The framework is widely used in production environments, including various Alibaba applications, to enable device-cloud collaborative machine learning.
81
PaddlePaddle / FastDeploy
FastDeploy is an inference deployment toolkit for large language models and vision-language models based on PaddlePaddle, designed to provide out-of-the-box production-grade deployment solutions. This tool supports various mainstream hardware platforms and integrates load-balanced PD separation, unified KV cache transmission, and multiple advanced acceleration technologies. Developers can achieve rapid deployment through OpenAI API-compatible interfaces and optimize inference performance using full quantization format support.
71
alibaba / rtp-llm
RTP-LLM is a high-performance LLM inference acceleration engine developed by the Alibaba Foundation Model Inference team. This engine has been widely applied in various Alibaba business scenarios such as Taobao and Tmall, supporting multiple mainstream model formats and hardware backends. It provides efficient production-level services for large language models by integrating advanced operator optimization, quantization techniques, and distributed inference capabilities.
70
PaddlePaddle / FastDeploy
FastDeploy is an inference deployment toolkit for large language models and vision-language models based on PaddlePaddle, aiming to provide out-of-the-box production-grade deployment solutions. The toolkit supports various mainstream hardware platforms and integrates core technologies such as load-balanced PD separation, unified KV cache transmission, and full quantization format support. By being compatible with OpenAI API and vLLM interfaces, it helps developers efficiently implement model inference and online service deployment.
68
alibaba / rtp-llm
RTP-LLM is a high-performance large model inference acceleration engine developed by the Alibaba Foundation Model Inference Team, widely used in various business scenarios such as Taobao and Tmall. By integrating various advanced CUDA kernels and quantization techniques, the engine significantly improves model inference performance and efficiency. Furthermore, it possesses high flexibility, supporting multiple model formats, multimodal inputs, and LoRA service deployment.
68
toverainc / willow
The Willow Inference Server allows users to self-host high-speed language inference tasks for various applications. It supports essential features including speech-to-text, text-to-speech, and large language model processing. Users can access official documentation and community support through the project's website and GitHub discussions.
67
alibaba / tair-kvcache
Tair KVCache is an Alibaba Cloud system designed to accelerate Large Language Model inference through distributed memory pooling and dynamic multi-level caching. The project provides a centralized manager for global KVCache metadata and storage capacity, ensuring efficient data reliability and resource utilization. Additionally, it includes a high-fidelity simulation tool that allows developers to predict performance metrics without requiring actual GPU resources.
62
google-ai-edge / LiteRT-LM
LiteRT-LM is a high-performance, production-ready inference framework designed by Google for deploying Large Language Models on edge devices. It supports a wide range of platforms including Android, iOS, desktop, and IoT, while leveraging GPU and NPU hardware acceleration for optimal performance. The framework enables advanced capabilities such as multi-modality and function calling, powering on-device AI experiences in various Google products.
54
alexzhang13 / rlm
Recursive Language Models (RLMs) provide a task-agnostic inference paradigm that enables language models to handle near-infinite contexts through programmatic decomposition and recursive self-calling. The framework replaces standard completion calls with an RLM-specific interface that offloads context into a REPL environment for interactive execution. This repository offers an extensible engine supporting various local and cloud-based sandbox environments to facilitate complex, multi-step language model reasoning.
49
mnfst / awesome-free-llm-apis
This repository provides a curated list of LLM API providers that offer permanent free tiers for text inference. It categorizes services into direct provider APIs and third-party inference platforms, detailing model capabilities, context windows, and rate limits. The collection serves as a comprehensive resource for developers seeking cost-effective access to various large language models.
43
baidu / vLLM-Kunlun
vLLM Kunlun is a community-maintained hardware plugin that enables the seamless execution of vLLM on Kunlun XPU devices. It functions as a hardware-pluggable interface, allowing users to run various large language and multimodal models without modifying the original vLLM source code. The project supports advanced features like quantization, LoRA fine-tuning, and hardware-accelerated graph optimization to ensure high-performance inference.
40
Michael-A-Kuykendall / shimmy
Shimmy is a lightweight, single-binary server that provides a 100% OpenAI-compatible API for running GGUF models locally. It features zero-configuration model discovery, automatic GPU backend detection, and advanced CPU/GPU hybrid processing for large models. Designed for privacy and performance, it allows developers to integrate local LLMs into existing tools without code changes.
37

// use cases by project

ncnn
  • 01Supports a variety of mainstream CNN models, including classification, detection, segmentation, and face recognition algorithms.
  • 02Provides cross-platform deployment capabilities, supporting environments such as Android, iOS, Windows, Linux, macOS, and WebAssembly.
  • 03Helps developers port deep learning algorithms to mobile devices through efficient implementation, enabling the rapid deployment of artificial intelligence applications.
ncnn
  • 01Efficiently deploy deep learning algorithm models on mobile devices
  • 02Support mainstream CNN networks such as YOLO, MobileNet, and ResNet
  • 03Achieve high-performance cross-platform neural network inference computation
MNN
  • 01On-device inference and training for mobile and embedded platforms
  • 02Large language model (LLM) and stable diffusion model deployment
  • 03Model conversion and optimization from frameworks like TensorFlow, ONNX, and PyTorch
FastDeploy
  • 01Load-balanced PD separation and dynamic instance role switching
  • 02Compatibility with OpenAI API interfaces and the vLLM ecosystem
  • 03High-performance inference and full quantization support for multi-hardware platforms
rtp-llm
  • 01Supports various quantization techniques (INT8/INT4) and high-performance operator optimization to increase inference speed.
  • 02Provides flexible features such as multi-LoRA service deployment, multimodal input processing, and tensor parallelism.
  • 03Equipped with advanced acceleration technologies like context prefix caching and speculative sampling to optimize multi-turn conversation performance.

// comparisons

// related topics