HunyuanDiT
HunyuanDiT-v1.2
















Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
🔥🔥🔥 News!!
This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our paper exploring Hunyuan-DiT. You can find more visualizations on our project page.
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
- Jul 03, 2024: 🎉 Kohya-hydit version now available for v1.1 and v1.2 models, with GUI for inference. Official Kohya version is under review. See kohya for details.
- Jun 27, 2024: 🎨 Hunyuan-Captioner is released, providing fine-grained caption for training data. See mllm for details.
- Jun 27, 2024: 🎉 Support LoRa and ControlNet in diffusers. See diffusers for details.
- Jun 27, 2024: 🎉 6GB GPU VRAM Inference scripts are released. See lite for details.
- Jun 19, 2024: 🎉 ControlNet is released, supporting canny, pose and depth control. See training/inference codes for details.
- Jun 13, 2024: ⚡ HYDiT-v1.1 version is released, which mitigates the issue of image oversaturation and alleviates the watermark issue. Please check HunyuanDiT-v1.1 and Distillation-v1.1 for more details.
- Jun 13, 2024: 🚚 The training code is released, offering full-parameter training and LoRA training.
- Jun 06, 2024: 🎉 Hunyuan-DiT is now available in ComfyUI. Please check ComfyUI for more details.
- Jun 06, 2024: 🚀 We introduce Distillation version for Hunyuan-DiT acceleration, which achieves 50% acceleration on NVIDIA GPUs. Please check Distillation for more details.
- Jun 05, 2024: 🤗 Hunyuan-DiT is now available in 🤗 Diffusers! Please check the example below.
- Jun 04, 2024: 🌐 Support Tencent Cloud links to download the pretrained models! Please check the links below.
- May 22, 2024: 🚀 We introduce TensorRT version for Hunyuan-DiT acceleration, which achieves 47% acceleration on NVIDIA GPUs. Please check TensorRT-libs for instructions.
- May 22, 2024: 💬 We support demo running multi-turn text2image generation now. Please check the script below.
We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully designed the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-round multi-modal dialogue with users, generating and refining images according to the context. Through our carefully designed holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.
Open-source Plan
Hunyuan-DiT Key Features
Chinese-English Bilingual DiT Architecture
Hunyuan-DiT is a diffusion model in the latent space, as depicted in figure below. Following the Latent Diffusion Model, we use a pre-trained Variational Autoencoder (VAE) to compress the images into low-dimensional latent spaces and train a diffusion model to learn the data distribution with diffusion models. Our diffusion model is parameterized with a transformer. To encode the text prompts, we leverage a combination of pre-trained bilingual (English and Chinese) CLIP and multilingual T5 encoder.
Multi-turn Text2Image Generation
Understanding natural language instructions and performing multi-turn interaction with users are important for a text-to-image system. It can help build a dynamic and iterative creation process that bring the user’s idea into reality step by step. In this section, we will detail how we empower Hunyuan-DiT with the ability to perform multi-round conversations and image generation. We train MLLM to understand the multi-round user dialogue and output the new text prompt for image generation.
Comparative analysis
In order to comprehensively compare the generation capabilities of HunyuanDiT and other models, we constructed a 4-dimensional test set, including Text-Image Consistency, Excluding AI Artifacts, Subject Clarity, Aesthetic. More than 50 professional evaluators performs the evaluation.
Visualization
Long Text Input
Chinese Elements
GitHub: https://github.com/Tencent/HunyuanDiT
Paper: https://tencent.github.io/HunyuanDiT/asset/Hunyuan_DiT_Tech_Report_05140553.pdf
Data creation process: https://github.com/Tencent/HunyuanDiT/blob/main/IndexKits/docs/MakeDataset.md