Diffusers 0.34.0: New Image and Video Models, Better torch.compile Support, and more - diffusers Release Notes

📹 New video generation pipelines

Wan VACE

Wan VACE supports various generation techniques which achieve controllable video generation. It comes in two variants: a 1.3B model for fast iteration & prototyping, and a 14B for high quality generation. Some of the capabilities include:

Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: huggingface/controlnet_aux
Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips)
Inpainting and Outpainting
Subject to Video (faces, object, characters, etc.)
Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.)

The code snippets available in this pull request demonstrate some examples of how videos can be generated with controllability signals.

Check out the docs to learn more.

Cosmos Predict2 Video2World

Cosmos-Predict2 is a key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.

The Video2World model comes in a 2B and 14B variant. Check out the docs to learn more.

LTX 0.9.7 and Distilled

LTX 0.9.7 and its distilled variants are the latest in the family of models released by Lightricks.

Check out the docs to learn more.

Hunyuan Video Framepack and F1

Framepack is a novel method for enabling long video generation. There are two released variants of Hunyuan Video trained using this technique. Check out the docs to learn more.

FusionX

The FusionX family of models and LoRAs, built on top of Wan2.1-14B, should already be supported. To load the model, use from_single_file():

from diffusers import WanTransformer3DModel

transformer = WanTransformer3DModel.from_single_file(
    "https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/Wan14Bi2vFusioniX_fp16.safetensors",
    torch_dtype=torch.bfloat16
)

To load the LoRAs, use load_lora_weights():

pipe = DiffusionPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
    torch_dtype=torch.bfloat16
).to("cuda")
pipe.load_lora_weights(
    "vrgamedevgirl84/Wan14BT2VFusioniX", weight_name="FusionX_LoRa/Wan2.1_T2V_14B_FusionX_LoRA.safetensors"
)

AccVideo and CausVid (only LoRAs)

AccVideo and CausVid are two novel distillation techniques that speed up the generation time of video diffusion models while preserving quality. Diffusers supports loading their extracted LoRAs with their respective models.

🌠 New image generation pipelines

Cosmos Predict2 Text2Image

Text-to-image models from the Cosmos-Predict2 release. The models comes in a 2B and 14B variant. Check out the docs to learn more.

Chroma

Chroma is a 8.9B parameter model based on FLUX.1-schnell. It’s fully Apache 2.0 licensed, ensuring that anyone can use, modify, and build on top of it. Checkout the docs to learn more

Thanks to @Ednaordinary for contributing it in this PR!

VisualCloze

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning is an innovative in-context learning framework based universal image generation framework that offers key capabilities:

Support for various in-domain tasks
Generalization to unseen tasks through in-context learning
Unify multiple tasks into one step and generate both target image and intermediate results
Support reverse-engineering conditions from target images

Check out the docs to learn more. Thanks to @lzyhha for contributing this in this PR!

Better `torch.compile` support

We have worked with the PyTorch team to improve how we provide torch.compile() compatibility throughout the library. More specifically, we now test the widely used models like Flux for any recompilation and graph break issues which can get in the way of fully realizing torch.compile() benefits. Refer to the following links to learn more:

https://github.com/huggingface/diffusers/pull/11085
https://github.com/huggingface/diffusers/issues/11430

Additionally, users can combine offloading with compilation to get a better speed-memory trade-off. Below is an example:

Code

import torch
from diffusers import DiffusionPipeline
torch._dynamo.config.cache_size_limit = 10000

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipline.enable_model_cpu_offload()
# Compile.
pipeline.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=0.,
    height=768,
    width=1360,
    num_inference_steps=4,
    max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

This is compatible with group offloading, too. Interested readers can check out the concerned PRs below:

https://github.com/huggingface/diffusers/pull/11605
https://github.com/huggingface/diffusers/pull/11670

You can substantially reduce memory requirements by combining quantization with offloading and then improving speed with torch.compile(). Below is an example:

Code

from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel

import torch
torch._dynamo.config.recompile_limit = 1000 

quant_kwargs = {"load_in_4bit": True, "bnb_4bit_compute_dtype": torch_dtype, "bnb_4bit_quant_type": "nf4"}
text_encoder_2_quant_config = TransformersBitsAndBytesConfig(**quant_kwargs)
dit_quant_config = DiffusersBitsAndBytesConfig(**quant_kwargs)

ckpt_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
    ckpt_id,
    subfolder="text_encoder_2",
    quantization_config=text_encoder_2_quant_config,
    torch_dtype=torch_dtype,
)
transformer = AutoModel.from_pretrained(
    ckpt_id,
    subfolder="transformer",
    quantization_config=dit_quant_config,
    torch_dtype=torch_dtype,
)
pipe = FluxPipeline.from_pretrained(
    ckpt_id,
    transformer=transformer,
    text_encoder_2=text_encoder_2,
    torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()
pipe.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=3.5,
    height=768,
    width=1360,
    num_inference_steps=28,
    max_sequence_length=512,
).images[0]

Starting from bitsandbytes==0.46.0 onwards, bnb-quantized models should be fully compatible with torch.compile() without graph-breaks. This means that when compiling a bnb-quantized model, users can do: model.compile(fullgraph=True). This can significantly improve speed while still providing memory benefits. The figure below provides a comparison with Flux.1-Dev. Refer to this benchmarking script to learn more.

Note that for 4bit bnb models, it’s currently needed to install PyTorch nightly if fullgraph=True is specified during compilation.

Huge shoutout to @anijain2305 and @StrongerXi from the PyTorch team for the incredible support.

PipelineQuantizationConfig

Users can now provide a quantization config while initializing a pipeline:

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

pipeline_quant_config = PipelineQuantizationConfig(
     quant_backend="bitsandbytes_4bit",
     quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
     components_to_quantize=["transformer", "text_encoder_2"],
)
pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

This reduces the barrier to entry for our users willing to use quantization without having to write too much code. Refer to the documentation to learn more about different configurations allowed through PipelineQuantizationConfig.

Group offloading with disk

In the previous release, we shipped “group offloading” which lets you offload blocks/nodes within a model, optimizing its memory consumption. It also lets you overlap this offloading with computation, providing a good speed-memory trade-off, especially in low VRAM environments.

However, you still need a considerable amount of system RAM to make offloading work effectively. So, low VRAM and low RAM environments would still not work.

Starting this release, users will additionally have the option to offload to disk instead of RAM, further lowering memory consumption. Set the offload_to_disk_path to enable this feature.

pipeline.transformer.enable_group_offload(
    onload_device="cuda", 
    offload_device="cpu", 
    offload_type="leaf_level", 
    offload_to_disk_path="path/to/disk"
)

Refer to these two tables to compare the speed and memory trade-offs.

LoRA metadata parsing

It is beneficial to include the LoraConfig in a LoRA state dict that was used to train the LoRA. In its absence, users were restricted to using the same LoRA alpha as the LoRA rank. We have modified the most popular training scripts to allow passing custom lora_alpha through the CLI. Refer to this thread for more updates. Refer to this comment for some extended clarifications.

New training scripts

We now have a capable training script for training robust timestep-distilled models through the SANA Sprint framework. Check out this resource for more details. Thanks to @scxue and @lawrence-cj for contributing it in this PR.
HiDream LoRA DreamBooth training script (docs). The script supports training with quantization. HiDream is an MIT-licensed model. So, make it yours with this training script.

Updates on educational materials on quantization

We have worked on a two-part series discussing the support of quantization in Diffusers. Check them out:

All commits

[LoRA] support musubi wan loras. by @sayakpaul in #11243
fix test_vanilla_funetuning failure on XPU and A100 by @yao-matrix in #11263
make test_stable_diffusion_inpaint_fp16 pass on XPU by @yao-matrix in #11264
make test_dict_tuple_outputs_equivalent pass on XPU by @yao-matrix in #11265
add onnxruntime-qnn & onnxruntime-cann by @xieofxie in #11269
make test_instant_style_multiple_masks pass on XPU by @yao-matrix in #11266
[BUG] Fix convert_vae_pt_to_diffusers bug by @lavinal712 in #11078
Fix LTX 0.9.5 single file by @hlky in #11271
[Tests] Cleanup lora tests utils by @sayakpaul in #11276
[CI] relax tolerance for unclip further by @sayakpaul in #11268
do not use DIFFUSERS_REQUEST_TIMEOUT for notification bot by @sayakpaul in #11273
Fix incorrect tile_latent_min_width calculation in AutoencoderKLMochi by @kuantuna in #11294
HiDream Image by @hlky in #11231
flow matching lcm scheduler by @quickjkee in #11170
Update autoencoderkl_allegro.md by @Forbu in #11303
Hidream refactoring follow ups by @a-r-r-o-w in #11299
Fix incorrect tile_latent_min_width calculations by @kuantuna in #11305
[ControlNet] Adds controlnet for SanaTransformer by @ishan-modi in #11040
make KandinskyV22PipelineInpaintCombinedFastTests::test_float16_inference pass on XPU by @yao-matrix in #11308
make test_stable_diffusion_karras_sigmas pass on XPU by @yao-matrix in #11310
make KolorsPipelineFastTests::test_inference_batch_single_identical pass on XPU by @faaany in #11313
[LoRA] support more SDXL loras. by @sayakpaul in #11292
[HiDream] code example by @linoytsaban in #11317
import for FlowMatchLCMScheduler by @asomoza in #11318
Use float32 on mps or npu in transformer_hidream_image's rope by @hlky in #11316
Add skrample section to community_projects.md by @Beinsezii in #11319
[docs] Promote AutoModel usage by @sayakpaul in #11300
[LoRA] Add LoRA support to AuraFlow by @hameerabbasi in #10216
Fix vae.Decoder prev_output_channel by @hlky in #11280
fix CPU offloading related fail cases on XPU by @yao-matrix in #11288
[docs] fix hidream docstrings. by @sayakpaul in #11325
Rewrite AuraFlowPatchEmbed.pe_selection_index_based_on_dim to be torch.compile compatible by @AstraliteHeart in #11297
post release 0.33.0 by @sayakpaul in #11255
another fix for FlowMatchLCMScheduler forgotten import by @asomoza in #11330

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@yao-matrix
- fix test_vanilla_funetuning failure on XPU and A100 (#11263)
- make test_stable_diffusion_inpaint_fp16 pass on XPU (#11264)
- make test_dict_tuple_outputs_equivalent pass on XPU (#11265)
- make test_instant_style_multiple_masks pass on XPU (#11266)
- make KandinskyV22PipelineInpaintCombinedFastTests::test_float16_inference pass on XPU (#11308)
- make test_stable_diffusion_karras_sigmas pass on XPU (#11310)
- fix CPU offloading related fail cases on XPU (#11288)
- enable 2 test cases on XPU (#11332)
- enable group_offload cases and quanto cases on XPU (#11405)
- enable test_layerwise_casting_memory cases on XPU (#11406)
- enable 28 GGUF test cases on XPU (#11404)
- enable marigold_intrinsics cases on XPU (#11445)
- enable consistency test cases on XPU, all passed (#11446)
- enable unidiffuser test cases on xpu (#11444)
- make safe diffusion test cases pass on XPU and A100 (#11458)
- make autoencoders. controlnet_flux and wan_transformer3d_single_file pass on xpu (#11461)
- enable semantic diffusion and stable diffusion panorama cases on XPU (#11459)
- enable lora cases on XPU (#11506)
- enable 7 cases on XPU (#11503)
- enable dit integration cases on xpu (#11523)
- enable print_env on xpu (#11507)
- enable several pipeline integration tests on XPU (#11526)
- enhance value guard of _device_agnostic_dispatch (#11553)
- enable pipeline test cases on xpu (#11527)
- enable group_offloading and PipelineDeviceAndDtypeStabilityTests on XPU, all passed (#11620)
- enable torchao test cases on XPU and switch to device agnostic APIs for test cases (#11654)
- enable cpu offloading of new pipelines on XPU & use device agnostic empty to make pipelines work on XPU (#11671)
@hlky
- Fix LTX 0.9.5 single file (#11271)
- HiDream Image (#11231)
- Use float32 on mps or npu in transformer_hidream_image's rope (#11316)
- Fix vae.Decoder prev_output_channel (#11280)
@quickjkee
- flow matching lcm scheduler (#11170)
@ishan-modi
- [ControlNet] Adds controlnet for SanaTransformer (#11040)
- [BUG] fixed _toctree.yml alphabetical ordering (#11277)
- [BUG] fixes in kadinsky pipeline (#11080)
- [Refactor] Minor Improvement for import utils (#11161)
- [Feature] Added Xlab Controlnet support (#11249)
- [BUG] fixed WAN docstring (#11226)

📹 New video generation pipelines

Wan VACE

Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: huggingface/controlnet_aux
Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips)
Inpainting and Outpainting
Subject to Video (faces, object, characters, etc.)
Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.)

The code snippets available in this pull request demonstrate some examples of how videos can be generated with controllability signals.

Check out the docs to learn more.

Cosmos Predict2 Video2World

The Video2World model comes in a 2B and 14B variant. Check out the docs to learn more.

LTX 0.9.7 and Distilled

LTX 0.9.7 and its distilled variants are the latest in the family of models released by Lightricks.

Check out the docs to learn more.

Hunyuan Video Framepack and F1

Framepack is a novel method for enabling long video generation. There are two released variants of Hunyuan Video trained using this technique. Check out the docs to learn more.

FusionX

The FusionX family of models and LoRAs, built on top of Wan2.1-14B, should already be supported. To load the model, use from_single_file():

from diffusers import WanTransformer3DModel

transformer = WanTransformer3DModel.from_single_file(
    "https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/Wan14Bi2vFusioniX_fp16.safetensors",
    torch_dtype=torch.bfloat16
)

To load the LoRAs, use load_lora_weights():

pipe = DiffusionPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
    torch_dtype=torch.bfloat16
).to("cuda")
pipe.load_lora_weights(
    "vrgamedevgirl84/Wan14BT2VFusioniX", weight_name="FusionX_LoRa/Wan2.1_T2V_14B_FusionX_LoRA.safetensors"
)

AccVideo and CausVid (only LoRAs)

🌠 New image generation pipelines

Cosmos Predict2 Text2Image

Text-to-image models from the Cosmos-Predict2 release. The models comes in a 2B and 14B variant. Check out the docs to learn more.

Chroma

Thanks to @Ednaordinary for contributing it in this PR!

VisualCloze

Support for various in-domain tasks
Generalization to unseen tasks through in-context learning
Unify multiple tasks into one step and generate both target image and intermediate results
Support reverse-engineering conditions from target images

Check out the docs to learn more. Thanks to @lzyhha for contributing this in this PR!

Better `torch.compile` support

https://github.com/huggingface/diffusers/pull/11085
https://github.com/huggingface/diffusers/issues/11430

Additionally, users can combine offloading with compilation to get a better speed-memory trade-off. Below is an example:

Code

import torch
from diffusers import DiffusionPipeline
torch._dynamo.config.cache_size_limit = 10000

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipline.enable_model_cpu_offload()
# Compile.
pipeline.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=0.,
    height=768,
    width=1360,
    num_inference_steps=4,
    max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

This is compatible with group offloading, too. Interested readers can check out the concerned PRs below:

https://github.com/huggingface/diffusers/pull/11605
https://github.com/huggingface/diffusers/pull/11670

You can substantially reduce memory requirements by combining quantization with offloading and then improving speed with torch.compile(). Below is an example:

Code

from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel

import torch
torch._dynamo.config.recompile_limit = 1000 

quant_kwargs = {"load_in_4bit": True, "bnb_4bit_compute_dtype": torch_dtype, "bnb_4bit_quant_type": "nf4"}
text_encoder_2_quant_config = TransformersBitsAndBytesConfig(**quant_kwargs)
dit_quant_config = DiffusersBitsAndBytesConfig(**quant_kwargs)

ckpt_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
    ckpt_id,
    subfolder="text_encoder_2",
    quantization_config=text_encoder_2_quant_config,
    torch_dtype=torch_dtype,
)
transformer = AutoModel.from_pretrained(
    ckpt_id,
    subfolder="transformer",
    quantization_config=dit_quant_config,
    torch_dtype=torch_dtype,
)
pipe = FluxPipeline.from_pretrained(
    ckpt_id,
    transformer=transformer,
    text_encoder_2=text_encoder_2,
    torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()
pipe.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=3.5,
    height=768,
    width=1360,
    num_inference_steps=28,
    max_sequence_length=512,
).images[0]

Note that for 4bit bnb models, it’s currently needed to install PyTorch nightly if fullgraph=True is specified during compilation.

Huge shoutout to @anijain2305 and @StrongerXi from the PyTorch team for the incredible support.

PipelineQuantizationConfig

Users can now provide a quantization config while initializing a pipeline:

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

pipeline_quant_config = PipelineQuantizationConfig(
     quant_backend="bitsandbytes_4bit",
     quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
     components_to_quantize=["transformer", "text_encoder_2"],
)
pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

Group offloading with disk

However, you still need a considerable amount of system RAM to make offloading work effectively. So, low VRAM and low RAM environments would still not work.

Starting this release, users will additionally have the option to offload to disk instead of RAM, further lowering memory consumption. Set the offload_to_disk_path to enable this feature.

pipeline.transformer.enable_group_offload(
    onload_device="cuda", 
    offload_device="cpu", 
    offload_type="leaf_level", 
    offload_to_disk_path="path/to/disk"
)

Refer to these two tables to compare the speed and memory trade-offs.

LoRA metadata parsing

New training scripts

We now have a capable training script for training robust timestep-distilled models through the SANA Sprint framework. Check out this resource for more details. Thanks to @scxue and @lawrence-cj for contributing it in this PR.
HiDream LoRA DreamBooth training script (docs). The script supports training with quantization. HiDream is an MIT-licensed model. So, make it yours with this training script.

Updates on educational materials on quantization

We have worked on a two-part series discussing the support of quantization in Diffusers. Check them out:

All commits

[LoRA] support musubi wan loras. by @sayakpaul in #11243
fix test_vanilla_funetuning failure on XPU and A100 by @yao-matrix in #11263
make test_stable_diffusion_inpaint_fp16 pass on XPU by @yao-matrix in #11264
make test_dict_tuple_outputs_equivalent pass on XPU by @yao-matrix in #11265
add onnxruntime-qnn & onnxruntime-cann by @xieofxie in #11269
make test_instant_style_multiple_masks pass on XPU by @yao-matrix in #11266
[BUG] Fix convert_vae_pt_to_diffusers bug by @lavinal712 in #11078
Fix LTX 0.9.5 single file by @hlky in #11271
[Tests] Cleanup lora tests utils by @sayakpaul in #11276
[CI] relax tolerance for unclip further by @sayakpaul in #11268
do not use DIFFUSERS_REQUEST_TIMEOUT for notification bot by @sayakpaul in #11273
Fix incorrect tile_latent_min_width calculation in AutoencoderKLMochi by @kuantuna in #11294
HiDream Image by @hlky in #11231
flow matching lcm scheduler by @quickjkee in #11170
Update autoencoderkl_allegro.md by @Forbu in #11303
Hidream refactoring follow ups by @a-r-r-o-w in #11299
Fix incorrect tile_latent_min_width calculations by @kuantuna in #11305
[ControlNet] Adds controlnet for SanaTransformer by @ishan-modi in #11040
make KandinskyV22PipelineInpaintCombinedFastTests::test_float16_inference pass on XPU by @yao-matrix in #11308
make test_stable_diffusion_karras_sigmas pass on XPU by @yao-matrix in #11310
make KolorsPipelineFastTests::test_inference_batch_single_identical pass on XPU by @faaany in #11313
[LoRA] support more SDXL loras. by @sayakpaul in #11292
[HiDream] code example by @linoytsaban in #11317
import for FlowMatchLCMScheduler by @asomoza in #11318
Use float32 on mps or npu in transformer_hidream_image's rope by @hlky in #11316
Add skrample section to community_projects.md by @Beinsezii in #11319
[docs] Promote AutoModel usage by @sayakpaul in #11300
[LoRA] Add LoRA support to AuraFlow by @hameerabbasi in #10216
Fix vae.Decoder prev_output_channel by @hlky in #11280
fix CPU offloading related fail cases on XPU by @yao-matrix in #11288
[docs] fix hidream docstrings. by @sayakpaul in #11325
Rewrite AuraFlowPatchEmbed.pe_selection_index_based_on_dim to be torch.compile compatible by @AstraliteHeart in #11297
post release 0.33.0 by @sayakpaul in #11255
another fix for FlowMatchLCMScheduler forgotten import by @asomoza in #11330

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@yao-matrix
- fix test_vanilla_funetuning failure on XPU and A100 (#11263)
- make test_stable_diffusion_inpaint_fp16 pass on XPU (#11264)
- make test_dict_tuple_outputs_equivalent pass on XPU (#11265)
- make test_instant_style_multiple_masks pass on XPU (#11266)
- make KandinskyV22PipelineInpaintCombinedFastTests::test_float16_inference pass on XPU (#11308)
- make test_stable_diffusion_karras_sigmas pass on XPU (#11310)
- fix CPU offloading related fail cases on XPU (#11288)
- enable 2 test cases on XPU (#11332)
- enable group_offload cases and quanto cases on XPU (#11405)
- enable test_layerwise_casting_memory cases on XPU (#11406)
- enable 28 GGUF test cases on XPU (#11404)
- enable marigold_intrinsics cases on XPU (#11445)
- enable consistency test cases on XPU, all passed (#11446)
- enable unidiffuser test cases on xpu (#11444)
- make safe diffusion test cases pass on XPU and A100 (#11458)
- make autoencoders. controlnet_flux and wan_transformer3d_single_file pass on xpu (#11461)
- enable semantic diffusion and stable diffusion panorama cases on XPU (#11459)
- enable lora cases on XPU (#11506)
- enable 7 cases on XPU (#11503)
- enable dit integration cases on xpu (#11523)
- enable print_env on xpu (#11507)
- enable several pipeline integration tests on XPU (#11526)
- enhance value guard of _device_agnostic_dispatch (#11553)
- enable pipeline test cases on xpu (#11527)
- enable group_offloading and PipelineDeviceAndDtypeStabilityTests on XPU, all passed (#11620)
- enable torchao test cases on XPU and switch to device agnostic APIs for test cases (#11654)
- enable cpu offloading of new pipelines on XPU & use device agnostic empty to make pipelines work on XPU (#11671)
@hlky
- Fix LTX 0.9.5 single file (#11271)
- HiDream Image (#11231)
- Use float32 on mps or npu in transformer_hidream_image's rope (#11316)
- Fix vae.Decoder prev_output_channel (#11280)
@quickjkee
- flow matching lcm scheduler (#11170)
@ishan-modi
- [ControlNet] Adds controlnet for SanaTransformer (#11040)
- [BUG] fixed _toctree.yml alphabetical ordering (#11277)
- [BUG] fixes in kadinsky pipeline (#11080)
- [Refactor] Minor Improvement for import utils (#11161)
- [Feature] Added Xlab Controlnet support (#11249)
- [BUG] fixed WAN docstring (#11226)

📹 New video generation pipelines

Wan VACE

Cosmos Predict2 Video2World

LTX 0.9.7 and Distilled

Hunyuan Video Framepack and F1

FusionX

AccVideo and CausVid (only LoRAs)

🌠 New image generation pipelines

Cosmos Predict2 Text2Image

Chroma

VisualCloze

Better torch.compile support

PipelineQuantizationConfig

Group offloading with disk

LoRA metadata parsing

New training scripts

Updates on educational materials on quantization

All commits

Significant community contributions

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

yt-dlp

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

yt-dlp

📹 New video generation pipelines

Wan VACE

Cosmos Predict2 Video2World

LTX 0.9.7 and Distilled

Hunyuan Video Framepack and F1

FusionX

AccVideo and CausVid (only LoRAs)

🌠 New image generation pipelines

Cosmos Predict2 Text2Image

Chroma

VisualCloze

Better torch.compile support

PipelineQuantizationConfig

Group offloading with disk

LoRA metadata parsing

New training scripts

Updates on educational materials on quantization

All commits

Significant community contributions

Better `torch.compile` support

Better `torch.compile` support