// summary
ERNIE-Image is an open-source text-to-image model developed by Baidu based on the Diffusion Transformer (DiT) architecture. The model is equipped with a lightweight prompt enhancer that transforms short inputs into structure-rich descriptions, achieving industry-leading generation results at an 8B parameter scale. It excels at handling complex text rendering, multi-object layout, and instruction-following tasks, while supporting efficient deployment on consumer-grade GPUs.
// technical analysis
ERNIE-Image is an open-source text-to-image model developed by Baidu based on the Diffusion Transformer (DiT) architecture. By introducing a lightweight Prompt Enhancer, the project transforms short user inputs into structured descriptions, significantly improving the model's ability to follow complex instructions. Its core technical advantage lies in achieving highly competitive performance with a compact 8B parameter scale, and it has been deeply optimized for text rendering and structured visual tasks, allowing it to run efficiently on consumer-grade GPUs with 24GB of VRAM.
// key highlights
// use cases
// getting started
Developers can quickly invoke the model via the Hugging Face diffusers library; simply install the latest version of diffusers and use ErnieImagePipeline to load the model for inference. For production environments, the project provides a deployment solution based on SGLang, which supports separate deployment of the DiT model and the Prompt Enhancer to improve inference speed.