HubLens › Topics › Computer Vision

// topic

Computer Vision

16trending in last 90 days·16all-time

// new this month

// ecosystem

AI 16

// recent newcomers

see all newcomers →

#1Thinking with Visual Primitives: Grounded Structural Reasoning🆕 2d ago↗ 127.35/d★ 213 #2Mano-P: GUI-Aware Private AI Agent for Edge Devices🆕 1mo ago↗ 99.23/d★ 1,264 #3Xiaomi Miloco Local Smart Home Copilot🆕 6mo ago↗ 68.06/d★ 2,549 #4ERNIE-Image: High-Performance Open-Source Text-to-Image Diffusion Model🆕 18d ago↗ 53.67/d★ 412

// this week's top 8

deepseek-ai / Thinking-with-Visual-Primitives

Thinking with Visual Primitives introduces a novel approach to Multimodal Large Language Models by interleaving spatial markers directly into the reasoning process. This method addresses the reference gap in complex structural tasks by anchoring abstract language to concrete physical coordinates. The framework achieves frontier-competitive performance while maintaining high visual token efficiency through a compressed architecture.

Mininglamp-AI / Mano-P

Mano-P is a GUI-VLA agent project designed to enable autonomous, private task execution on edge devices like Mac mini and MacBook. It utilizes advanced reinforcement learning and edge-native inference to perform complex GUI automation, cross-system data integration, and long-task planning. The project provides a secure, local-first solution that eliminates the need for cloud API calls while maintaining high performance across various benchmarks.

XiaoMi / xiaomi-miloco

Xiaomi Miloco is an open-source smart home solution that utilizes on-device large language models to integrate and control IoT devices. By leveraging camera data streams, the system enables natural language interaction for complex home automation and event analysis. It prioritizes user privacy by performing visual understanding and task planning locally on the user's hardware.

baidu / ERNIE-Image

ERNIE-Image is an open-source text-to-image model developed by Baidu based on the Diffusion Transformer (DiT) architecture. The model is equipped with a lightweight prompt enhancer that transforms short inputs into structure-rich descriptions, achieving industry-leading generation results at an 8B parameter scale. It excels at handling complex text rendering, multi-object layout, and instruction-following tasks, while supporting efficient deployment on consumer-grade GPUs.

bilibili / Index-anisora

Index-AniSora is a powerful open-source framework designed specifically for high-quality anime video generation and animation production. The system features a comprehensive data processing pipeline, a controllable generation model with spatiotemporal masking, and a specialized evaluation benchmark. It supports diverse creative tasks including character 3D generation, video style transfer, and multimodal guidance for precise motion control.

Cua provides a unified ecosystem for building, benchmarking, and deploying autonomous agents capable of interacting with computer interfaces. The platform includes specialized tools for background macOS automation, cross-platform sandboxing, and high-performance virtualization. Developers can leverage these components to create agents that perform tasks, execute code, and navigate complex GUI environments seamlessly.

nikopueringer / CorridorKey

CorridorKey is a neural network-based tool designed to solve the complex problem of unmixing foreground subjects from green or blue screen backgrounds. It reconstructs the true straight color and linear alpha channel for every pixel, effectively preserving fine details like hair and motion blur. The project supports high-fidelity VFX workflows by outputting 16-bit and 32-bit Linear float EXR files compatible with industry-standard compositing software.

Anil-matcha / Open-Generative-AI

Open Generative AI is a free, open-source platform providing an unrestricted alternative to commercial AI media tools. It supports over 200 state-of-the-art models for image, video, and lip-sync generation without content filters or subscription fees. Users can access these capabilities through a web-based interface or a desktop application that supports both local and remote inference.

// all-time featured (16)

PaddlePaddle / PaddleOCR

PaddleOCR is a comprehensive toolkit designed to convert images and PDF documents into structured, LLM-ready data formats like Markdown and JSON. It features state-of-the-art vision-language models and high-performance text recognition engines that support over 100 languages. The platform is widely integrated into major AI agent and RAG frameworks, offering efficient deployment options across various hardware backends.

ncnn is a high-performance neural network forward computation framework specifically optimized for mobile platforms, designed to simplify the deployment of deep learning algorithms on mobile devices. The framework has no third-party dependencies and features cross-platform capabilities, with execution speeds on mobile CPUs that outperform all currently known open-source frameworks. Currently, ncnn is widely used in various mainstream applications under Tencent, helping developers easily build intelligent applications.

ncnn is a high-performance neural network forward computation framework deeply optimized for mobile platforms. The framework has no third-party dependencies and features cross-platform capabilities, outperforming all known open-source frameworks on mobile CPUs. Developers can easily port deep learning models to mobile devices using ncnn to build various intelligent applications.

deepseek-ai / Thinking-with-Visual-Primitives

Thinking with Visual Primitives introduces a novel approach to Multimodal Large Language Models by interleaving spatial markers directly into the reasoning process. This method addresses the reference gap in complex structural tasks by anchoring abstract language to concrete physical coordinates. The framework achieves frontier-competitive performance while maintaining high visual token efficiency through a compressed architecture.

Mininglamp-AI / Mano-P

Mano-P is a GUI-VLA agent project designed to enable autonomous, private task execution on edge devices like Mac mini and MacBook. It utilizes advanced reinforcement learning and edge-native inference to perform complex GUI automation, cross-system data integration, and long-task planning. The project provides a secure, local-first solution that eliminates the need for cloud API calls while maintaining high performance across various benchmarks.

MNN is a high-performance, lightweight deep learning framework designed for efficient model inference and training on mobile and embedded devices. It supports a wide range of neural network architectures and provides versatile tools for model conversion, compression, and general-purpose computation. The framework is widely used in production environments, including various Alibaba applications, to enable device-cloud collaborative machine learning.

XiaoMi / xiaomi-miloco

Xiaomi Miloco is an open-source smart home solution that utilizes on-device large language models to integrate and control IoT devices. By leveraging camera data streams, the system enables natural language interaction for complex home automation and event analysis. It prioritizes user privacy by performing visual understanding and task planning locally on the user's hardware.

PaddlePaddle / PaddleX

PaddleX 3.0 is a low-code development tool built on the PaddlePaddle framework, integrating a vast array of out-of-the-box pre-trained models to support full-process development. Through a minimalist Python API and a graphical interface, the tool enables rapid implementation from model training to inference deployment. Furthermore, it is widely compatible with mainstream domestic and international hardware, helping developers efficiently complete industrial practices.

baidu / ERNIE-Image

ERNIE-Image is an open-source text-to-image model developed by Baidu based on the Diffusion Transformer (DiT) architecture. The model is equipped with a lightweight prompt enhancer that transforms short inputs into structure-rich descriptions, achieving industry-leading generation results at an 8B parameter scale. It excels at handling complex text rendering, multi-object layout, and instruction-following tasks, while supporting efficient deployment on consumer-grade GPUs.

bilibili / Index-anisora

Index-AniSora is a powerful open-source framework designed specifically for high-quality anime video generation and animation production. The system features a comprehensive data processing pipeline, a controllable generation model with spatiotemporal masking, and a specialized evaluation benchmark. It supports diverse creative tasks including character 3D generation, video style transfer, and multimodal guidance for precise motion control.

bilibili / Index-anisora

Index-AniSora is a comprehensive open-source system developed by Bilibili for high-quality anime video generation. The project provides a controllable generation model, a specialized data processing pipeline, and an evaluation benchmark tailored for animation aesthetics. It supports advanced features such as character 3D video generation, video style transfer, and multimodal guidance to facilitate diverse animation production tasks.

XiaoMi / xiaomi-miloco

Xiaomi Miloco is an open-source exploration solution that integrates Xiaomi Home cameras with a self-developed LLM to control IoT devices. It utilizes an on-device model to process visual data for scene understanding while ensuring user privacy and security. Users can define complex home rules and interact with their smart ecosystem using natural language.

Cua provides a unified ecosystem for building, benchmarking, and deploying autonomous agents capable of interacting with computer interfaces. The platform includes specialized tools for background macOS automation, cross-platform sandboxing, and high-performance virtualization. Developers can leverage these components to create agents that perform tasks, execute code, and navigate complex GUI environments seamlessly.

jd-opensource / JoyAI-Image

JoyAI-Image is a unified multimodal foundation model that integrates an 8B Multimodal Large Language Model with a 16B Multimodal Diffusion Transformer to support image understanding, generation, and editing. The model utilizes a closed-loop collaboration between understanding and generation to enhance spatial reasoning and controllable editing capabilities. It provides a scalable training pipeline and supports advanced features like multi-view generation and precise spatial manipulation.

nikopueringer / CorridorKey

CorridorKey is a neural network-based tool designed to solve the complex problem of unmixing foreground subjects from green or blue screen backgrounds. It reconstructs the true straight color and linear alpha channel for every pixel, effectively preserving fine details like hair and motion blur. The project supports high-fidelity VFX workflows by outputting 16-bit and 32-bit Linear float EXR files compatible with industry-standard compositing software.

Anil-matcha / Open-Generative-AI

Open Generative AI is a free, open-source platform providing an unrestricted alternative to commercial AI media tools. It supports over 200 state-of-the-art models for image, video, and lip-sync generation without content filters or subscription fees. Users can access these capabilities through a web-based interface or a desktop application that supports both local and remote inference.

// use cases by project

01Intelligent document parsing for LLM-ready structured data extraction
02Universal multilingual text recognition for natural scene and document analysis
03Building high-quality datasets for fine-tuning Large Language Models

01Supports a variety of mainstream CNN models, including classification, detection, segmentation, and face recognition algorithms.
02Provides cross-platform deployment capabilities, supporting environments such as Android, iOS, Windows, Linux, macOS, and WebAssembly.
03Helps developers port deep learning algorithms to mobile devices through efficient implementation, enabling the rapid deployment of artificial intelligence applications.

01Efficiently deploy deep learning algorithm models on mobile devices
02Support mainstream CNN networks such as YOLO, MobileNet, and ResNet
03Achieve high-performance cross-platform neural network inference computation

Thinking-with-Visual-Primitives

01Grounded task reasoning using spatial markers
02Complex topological reasoning in visual environments
03Efficient visual processing with reduced token consumption

01Complex GUI automation for autonomous interface operations
02End-to-end autonomous software construction pipelines
03Private, local-side business process and task execution

// comparisons

PaddleOCR vs FlashMLA ncnn vs ncnn ncnn vs MNN FastDeploy vs ncnn

// related topics

Deep Learning (7)LLM (6)Generative AI (5)Machine Learning (3)Video Generation (3)