baidu

ERNIE-Image

AI🌱 NEW PROJECT BOOST#Diffusion Transformer#Text-to-Image#Generative AI #Computer Vision

412

// summary

ERNIE-Image is an open-source text-to-image model developed by Baidu based on the Diffusion Transformer (DiT) architecture. The model is equipped with a lightweight prompt enhancer that transforms short inputs into structure-rich descriptions, achieving industry-leading generation results at an 8B parameter scale. It excels at handling complex text rendering, multi-object layout, and instruction-following tasks, while supporting efficient deployment on consumer-grade GPUs.

// technical analysis

ERNIE-Image is an open-source text-to-image model developed by Baidu based on the Diffusion Transformer (DiT) architecture. By introducing a lightweight Prompt Enhancer, the project transforms short user inputs into structured descriptions, significantly improving the model's ability to follow complex instructions. Its core technical advantage lies in achieving highly competitive performance with a compact 8B parameter scale, and it has been deeply optimized for text rendering and structured visual tasks, allowing it to run efficiently on consumer-grade GPUs with 24GB of VRAM.

// key highlights

Adopts a compact 8B parameter DiT architecture, providing generation quality comparable to large models while remaining lightweight.

Features excellent text rendering capabilities, capable of accurately generating high-difficulty visual content such as long text, posters, and UI interfaces.

Includes a built-in Prompt Enhancer module that can automatically expand simple prompts into high-quality structured descriptions.

Supports complex instruction following, accurately handling multi-object relationships, knowledge-intensive descriptions, and multi-panel composition tasks.

Provides an ERNIE-Image-Turbo version, achieving high-speed generation in just 8 steps through DMD and RL optimization.

Widely compatible with the open-source ecosystem, supporting ComfyUI workflows, Unsloth GGUF builds, and AI-Toolkit fine-tuning.

// use cases

High-quality poster and infographic generation

Multi-object and layout control under complex instructions

Multi-style image creation and rapid inference acceleration

// getting started

Developers can quickly invoke the model via the Hugging Face diffusers library; simply install the latest version of diffusers and use ErnieImagePipeline to load the model for inference. For production environments, the project provides a deployment solution based on SGLang, which supports separate deployment of the DiT model and the Prompt Enhancer to improve inference speed.