HubLensTopicsComputer Vision
// topic

Computer Vision

16trending in last 90 days·16all-time

// new this month

// ecosystem

Deep Learning7LLM6Generative AI5Machine Learning3Video Generation3Computer Vision
AI 16

// recent newcomers

see all newcomers →

// this week's top 8

01
deepseek-ai / Thinking-with-Visual-Primitives
Thinking with Visual Primitives introduces a novel approach to Multimodal Large Language Models by interleaving spatial markers directly into the reasoning process. This method addresses the reference gap in complex structural tasks by anchoring abstract language to concrete physical coordinates. The framework achieves frontier-competitive performance while maintaining high visual token efficiency through a compressed architecture.
84213
02
Mininglamp-AI / Mano-P
Mano-P is a GUI-VLA agent project designed to enable autonomous, private task execution on edge devices like Mac mini and MacBook. It utilizes advanced reinforcement learning and edge-native inference to perform complex GUI automation, cross-system data integration, and long-task planning. The project provides a secure, local-first solution that eliminates the need for cloud API calls while maintaining high performance across various benchmarks.
831,264
03
XiaoMi / xiaomi-miloco
Xiaomi Miloco is an open-source smart home solution that utilizes on-device large language models to integrate and control IoT devices. By leveraging camera data streams, the system enables natural language interaction for complex home automation and event analysis. It prioritizes user privacy by performing visual understanding and task planning locally on the user's hardware.
742,549
04
baidu / ERNIE-Image
ERNIE-Image is an open-source text-to-image model developed by Baidu based on the Diffusion Transformer (DiT) architecture. The model is equipped with a lightweight prompt enhancer that transforms short inputs into structure-rich descriptions, achieving industry-leading generation results at an 8B parameter scale. It excels at handling complex text rendering, multi-object layout, and instruction-following tasks, while supporting efficient deployment on consumer-grade GPUs.
71412
05
bilibili / Index-anisora
Index-AniSora is a powerful open-source framework designed specifically for high-quality anime video generation and animation production. The system features a comprehensive data processing pipeline, a controllable generation model with spatiotemporal masking, and a specialized evaluation benchmark. It supports diverse creative tasks including character 3D generation, video style transfer, and multimodal guidance for precise motion control.
682,421
06
trycua / cua
Cua provides a unified ecosystem for building, benchmarking, and deploying autonomous agents capable of interacting with computer interfaces. The platform includes specialized tools for background macOS automation, cross-platform sandboxing, and high-performance virtualization. Developers can leverage these components to create agents that perform tasks, execute code, and navigate complex GUI environments seamlessly.
55103
07
nikopueringer / CorridorKey
CorridorKey is a neural network-based tool designed to solve the complex problem of unmixing foreground subjects from green or blue screen backgrounds. It reconstructs the true straight color and linear alpha channel for every pixel, effectively preserving fine details like hair and motion blur. The project supports high-fidelity VFX workflows by outputting 16-bit and 32-bit Linear float EXR files compatible with industry-standard compositing software.
4223
08
Anil-matcha / Open-Generative-AI
Open Generative AI is a free, open-source platform providing an unrestricted alternative to commercial AI media tools. It supports over 200 state-of-the-art models for image, video, and lip-sync generation without content filters or subscription fees. Users can access these capabilities through a web-based interface or a desktop application that supports both local and remote inference.
39129

// all-time featured (16)

PaddlePaddle / PaddleOCR
PaddleOCR is a comprehensive toolkit designed to convert images and PDF documents into structured, LLM-ready data formats like Markdown and JSON. It features state-of-the-art vision-language models and high-performance text recognition engines that support over 100 languages. The platform is widely integrated into major AI agent and RAG frameworks, offering efficient deployment options across various hardware backends.
89
Tencent / ncnn
ncnn is a high-performance neural network forward computation framework specifically optimized for mobile platforms, designed to simplify the deployment of deep learning algorithms on mobile devices. The framework has no third-party dependencies and features cross-platform capabilities, with execution speeds on mobile CPUs that outperform all currently known open-source frameworks. Currently, ncnn is widely used in various mainstream applications under Tencent, helping developers easily build intelligent applications.
89
Tencent / ncnn
ncnn is a high-performance neural network forward computation framework deeply optimized for mobile platforms. The framework has no third-party dependencies and features cross-platform capabilities, outperforming all known open-source frameworks on mobile CPUs. Developers can easily port deep learning models to mobile devices using ncnn to build various intelligent applications.
87
deepseek-ai / Thinking-with-Visual-Primitives
Thinking with Visual Primitives introduces a novel approach to Multimodal Large Language Models by interleaving spatial markers directly into the reasoning process. This method addresses the reference gap in complex structural tasks by anchoring abstract language to concrete physical coordinates. The framework achieves frontier-competitive performance while maintaining high visual token efficiency through a compressed architecture.
84
Mininglamp-AI / Mano-P
Mano-P is a GUI-VLA agent project designed to enable autonomous, private task execution on edge devices like Mac mini and MacBook. It utilizes advanced reinforcement learning and edge-native inference to perform complex GUI automation, cross-system data integration, and long-task planning. The project provides a secure, local-first solution that eliminates the need for cloud API calls while maintaining high performance across various benchmarks.
83
alibaba / MNN
MNN is a high-performance, lightweight deep learning framework designed for efficient model inference and training on mobile and embedded devices. It supports a wide range of neural network architectures and provides versatile tools for model conversion, compression, and general-purpose computation. The framework is widely used in production environments, including various Alibaba applications, to enable device-cloud collaborative machine learning.
81
XiaoMi / xiaomi-miloco
Xiaomi Miloco is an open-source smart home solution that utilizes on-device large language models to integrate and control IoT devices. By leveraging camera data streams, the system enables natural language interaction for complex home automation and event analysis. It prioritizes user privacy by performing visual understanding and task planning locally on the user's hardware.
74
PaddlePaddle / PaddleX
PaddleX 3.0 is a low-code development tool built on the PaddlePaddle framework, integrating a vast array of out-of-the-box pre-trained models to support full-process development. Through a minimalist Python API and a graphical interface, the tool enables rapid implementation from model training to inference deployment. Furthermore, it is widely compatible with mainstream domestic and international hardware, helping developers efficiently complete industrial practices.
72
baidu / ERNIE-Image
ERNIE-Image is an open-source text-to-image model developed by Baidu based on the Diffusion Transformer (DiT) architecture. The model is equipped with a lightweight prompt enhancer that transforms short inputs into structure-rich descriptions, achieving industry-leading generation results at an 8B parameter scale. It excels at handling complex text rendering, multi-object layout, and instruction-following tasks, while supporting efficient deployment on consumer-grade GPUs.
71
bilibili / Index-anisora
Index-AniSora is a powerful open-source framework designed specifically for high-quality anime video generation and animation production. The system features a comprehensive data processing pipeline, a controllable generation model with spatiotemporal masking, and a specialized evaluation benchmark. It supports diverse creative tasks including character 3D generation, video style transfer, and multimodal guidance for precise motion control.
68
bilibili / Index-anisora
Index-AniSora is a comprehensive open-source system developed by Bilibili for high-quality anime video generation. The project provides a controllable generation model, a specialized data processing pipeline, and an evaluation benchmark tailored for animation aesthetics. It supports advanced features such as character 3D video generation, video style transfer, and multimodal guidance to facilitate diverse animation production tasks.
61
XiaoMi / xiaomi-miloco
Xiaomi Miloco is an open-source exploration solution that integrates Xiaomi Home cameras with a self-developed LLM to control IoT devices. It utilizes an on-device model to process visual data for scene understanding while ensuring user privacy and security. Users can define complex home rules and interact with their smart ecosystem using natural language.
57
trycua / cua
Cua provides a unified ecosystem for building, benchmarking, and deploying autonomous agents capable of interacting with computer interfaces. The platform includes specialized tools for background macOS automation, cross-platform sandboxing, and high-performance virtualization. Developers can leverage these components to create agents that perform tasks, execute code, and navigate complex GUI environments seamlessly.
55
jd-opensource / JoyAI-Image
JoyAI-Image is a unified multimodal foundation model that integrates an 8B Multimodal Large Language Model with a 16B Multimodal Diffusion Transformer to support image understanding, generation, and editing. The model utilizes a closed-loop collaboration between understanding and generation to enhance spatial reasoning and controllable editing capabilities. It provides a scalable training pipeline and supports advanced features like multi-view generation and precise spatial manipulation.
52
nikopueringer / CorridorKey
CorridorKey is a neural network-based tool designed to solve the complex problem of unmixing foreground subjects from green or blue screen backgrounds. It reconstructs the true straight color and linear alpha channel for every pixel, effectively preserving fine details like hair and motion blur. The project supports high-fidelity VFX workflows by outputting 16-bit and 32-bit Linear float EXR files compatible with industry-standard compositing software.
42
Anil-matcha / Open-Generative-AI
Open Generative AI is a free, open-source platform providing an unrestricted alternative to commercial AI media tools. It supports over 200 state-of-the-art models for image, video, and lip-sync generation without content filters or subscription fees. Users can access these capabilities through a web-based interface or a desktop application that supports both local and remote inference.
39

// use cases by project

PaddleOCR
  • 01Intelligent document parsing for LLM-ready structured data extraction
  • 02Universal multilingual text recognition for natural scene and document analysis
  • 03Building high-quality datasets for fine-tuning Large Language Models
ncnn
  • 01Supports a variety of mainstream CNN models, including classification, detection, segmentation, and face recognition algorithms.
  • 02Provides cross-platform deployment capabilities, supporting environments such as Android, iOS, Windows, Linux, macOS, and WebAssembly.
  • 03Helps developers port deep learning algorithms to mobile devices through efficient implementation, enabling the rapid deployment of artificial intelligence applications.
ncnn
  • 01Efficiently deploy deep learning algorithm models on mobile devices
  • 02Support mainstream CNN networks such as YOLO, MobileNet, and ResNet
  • 03Achieve high-performance cross-platform neural network inference computation
Thinking-with-Visual-Primitives
  • 01Grounded task reasoning using spatial markers
  • 02Complex topological reasoning in visual environments
  • 03Efficient visual processing with reduced token consumption
Mano-P
  • 01Complex GUI automation for autonomous interface operations
  • 02End-to-end autonomous software construction pipelines
  • 03Private, local-side business process and task execution

// comparisons

// related topics