Diffusers 0.33.0: New Image and Video Models, Memory Optimizations, Caching Methods, Remote VAEs, New Training Scripts, and more
New Pipelines for Video Generation
Wan 2.1
Wan2.1 is a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. The model release includes 4 different model variants and three different pipelines for Text to Video, Image to Video and Video to Video.
LTX Video 0.9.5 is the updated version of the super-fast LTX Video model series. The latest model introduces additional conditioning options, such as keyframe-based animation and video extension (both forward and backward).
To support these additional conditioning inputs, weâve introduced the LTXConditionPipeline and LTXVideoCondition object.
To learn more about the usage, check out the docs here.
Hunyuan Image to Video
Hunyuan utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder. The input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data and seamlessly integrating information from both the image and its associated caption.
ConsisID (thanks to @SHYuanBest for contributing this in this PR)
New Pipelines for Image Generation
Sana-Sprint
SANA-Sprint is an efficient diffusion model for ultra-fast text-to-image generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4, rivaling the quality of models like Flux.
Shoutout to @lawrence-cj for their help and guidance on this PR.
Check out the pipeline docs of SANA-Sprint to learn more.
Lumina2
Lumina-Image-2.0 is a 2B parameter flow-based diffusion transformer for text-to-image generation released under the Apache 2.0 license.
Check out the docs to learn more. Thanks to @zhuole1025 for contributing this through this PR.
One can also LoRA fine-tune Lumina2, taking advantage of its Apach2.0 licensing. Check out the guide for more details.
Omnigen
OmniGen is a unified image generation model that can handle multiple tasks including text-to-image, image editing, subject-driven generation, and various computer vision tasks within a single framework. The model consists of a VAE, and a single transformer based on Phi-3 that handles text and image encoding as well as the diffusion process.
Check out the docs to learn more about OmniGen. Thanks to @staoxiao for contributing OmniGen in this PR.
Others
CogView4 (thanks to @zRzRzRzRzRzRzR for contributing CogView4 in this PR)
New Memory Optimizations
Layerwise Casting
PyTorch supports torch.float8_e4m3fn and torch.float8_e5m2 as weight storage dtypes, but they canât be used for computation on many devices due to unimplemented kernel support.
However, you can still use these dtypes to store model weights in FP8 precision and upcast them to a widely supported dtype such as torch.float16 or torch.bfloat16 on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting. This can potentially cut down the VRAM requirements of a model by 50%. Â
Code
import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
model_id = "THUDM/CogVideoX-5b"
# Load the model in bfloat16 and enable layerwise casting
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
# Load the pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)
Group Offloading
Group offloading is the middle ground between sequential and model offloading. It works by offloading groups of internal layers (either torch.nn.ModuleList or torch.nn.Sequential), which uses less memory than model-level offloading. It is also faster than sequential-level offloading because the number of device synchronizations is reduced.
On CUDA devices, we also have the option to enable using layer prefetching with CUDA Streams. The next layer to be executed is loaded onto the accelerator device while the current layer is being executed which makes inference substantially faster while still keeping VRAM requirements very low. With this, we introduce the idea of overlapping computation with data transfer.
One thing to note is that using CUDA streams can cause a considerable spike in CPU RAM usage. Please ensure that the available CPU RAM is 2 times the size of the model if you choose to set use_stream=True. You can reduce CPU RAM usage by setting low_cpu_mem_usage=True. This should limit the CPU RAM used to be roughly the same as the size of the model, but will introduce slight latency in the inference process.
You can also use record_stream=True when using use_stream=True to obtain more speedups at the expense of slightly increased memory usage.
Code
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
# We can utilize the enable_group_offload method for Diffusers model implementations
pipe.transformer.enable_group_offload(
onload_device=onload_device,
offload_device=offload_device,
offload_type="leaf_level",
use_stream=True
)
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
# This utilized about 14.79 GB. It can be further reduced by using tiling and using leaf_level offloading throughout the pipeline.
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)
Group offloading can also be applied to non-Diffusers models such as text encoders from the transformers library.
Code
import torch
from diffusers import CogVideoXPipeline
from diffusers.hooks import apply_group_offloading
from diffusers.utils import export_to_video
# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
# For any other model implementations, the apply_group_offloading function can be used
apply_group_offloading(pipe.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)
Remote Components
Remote components are an experimental feature designed to offload memory-intensive steps of the inference pipeline to remote endpoints. The initial implementation focuses primarily on VAE decoding operations. Below are the currently supported model endpoints:
Cached Inference for Diffusion Transformer models is a performance optimization that significantly accelerates the denoising process by caching intermediate values. This technique reduces redundant computations across timesteps, resulting in faster generation with a slight dip in output quality.
Check out the docs to learn more about the available caching methods.
Improved loading for uintx TorchAO checkpoints with torch>=2.6
TorchAO checkpoints currently have to be serialized using pickle. For some quantization dtypes using the uintx format, such as uint4wo this involves saving subclassed TorchAO Tensor objects in the model file. This made loading the models directly with Diffusers a bit tricky since we do not allow deserializing artbitary Python objects from pickle files.
Torch 2.6 allows adding expected Tensors to torch safe globals, which lets us directly load TorchAO checkpoints with these objects.
We have shipped a couple of improvements on the LoRA front in this release.
đ¨ Improved coverage for loading non-diffusers LoRA checkpoints for Flux
Take note of the breaking change introduced in this PR đ¨Â We suggest you upgrade your peft installation to the latest version - pip install -U peft especially when dealing with Flux LoRAs.
torch.compile() support when hotswapping LoRAs without triggering recompilation
A common use case when serving multiple adapters is to load one adapter first, generate images, load another adapter, generate more images, load another adapter, etc. This workflow normally requires calling load_lora_weights(), set_adapters(), and possibly delete_adapters() to save memory. Moreover, if the model is compiled using torch.compile, performing these steps requires recompilation, which takes time.
To better support this common workflow, you can âhotswapâ a LoRA adapter, to avoid accumulating memory and in some cases, recompilation. It requires an adapter to already be loaded, and the new adapter weights are swapped in-place for the existing adapter.
Check out the docs to learn more about this feature.
The other major change is the support for
Loading LoRAs into quantized model checkpoints
dtype Maps for Pipelines
Since various pipelines require their components to run in different compute dtypes, we now support passing a dtype map when initializing a pipeline:
This release includes an AutoModel object similar to the one found in transformers that automatically fetches the appropriate model class for the provided repo.
from diffusers import AutoModel
unet = AutoModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="unet")
All commits
[Sana 4K] Add vae tiling option to avoid OOM by @leisuzz in #10583
IP-Adapter for StableDiffusion3Img2ImgPipeline by @guiyrt in #10589
[DC-AE, SANA] fix SanaMultiscaleLinearAttention apply_quadratic_attention bf16 by @chenjy2003 in #10595
Move buffers to device by @hlky in #10523
[Docs] Update SD3 ip_adapter model_id to diffusers checkpoint by @guiyrt in #10597
Scheduling fixes on MPS by @hlky in #10549
[Docs] Add documentation about using ParaAttention to optimize FLUX and HunyuanVideo by @chengzeyi in #10544
NPU adaption for RMSNorm by @leisuzz in #10534
implementing flux on TPUs with ptxla by @entrpn in #10515
[core] ConsisID by @SHYuanBest in #10140
[training] set rest of the blocks with requires_grad False. by @sayakpaul in #10607
chore: remove redundant words by @sunxunle in #10609
bugfix for npu not support float64 by @baymax591 in #10123
[chore] change licensing to 2025 from 2024. by @sayakpaul in #10615
Enable dreambooth lora finetune example on other devices by @jiqing-feng in #10602
Remove the FP32 Wrapper when evaluating by @lmxyy in #10617
[tests] make tests device-agnostic (part 3) by @faaany in #10437
fix offload gpu tests etc by @yiyixuxu in #10366
Remove cache migration script by @Wauplin in #10619
[core] Layerwise Upcasting by @a-r-r-o-w in #10347
Improve TorchAO error message by @a-r-r-o-w in #10627
[CI] Update HF_TOKEN in all workflows by @DN6 in #10613
add onnxruntime-migraphx as part of check for onnxruntime in import_utils.py by @kahmed10 in #10624
[Tests] modify the test slices for the failing flax test by @sayakpaul in #10630
[docs] fix image path in para attention docs by @sayakpaul in #10632
[docs] uv installation by @stevhliu in #10622
width and height are mixed-up by @raulc0399 in #10629
Add IP-Adapter example to Flux docs by @hlky in #10633
removing redundant requires_grad = False by @YanivDorGalron in #10628
[chore] add a script to extract loras from full fine-tuned models by @sayakpaul in #10631
Add pipeline_stable_diffusion_xl_attentive_eraser by @Anonym0u3 in #10579
NPU Adaption for Sanna by @leisuzz in #10409
Add sigmoid scheduler in scheduling_ddpm.py docs by @JacobHelwig in #10648
create a script to train autoencoderkl by @lavinal712 in #10605
Add community pipeline for semantic guidance for FLUX by @Marlon154 in #10610
ControlNet Union controlnet_conditioning_scale for multiple control inputs by @hlky in #10666
[training] Convert to ImageFolder script by @hlky in #10664
Add provider_options to OnnxRuntimeModel by @hlky in #10661
fix check_inputs func in LuminaText2ImgPipeline by @victolee0 in #10651
SDXL ControlNet Union pipelines, make control_image argument immutible by @Teriks in #10663
Revert RePaint scheduler 'fix' by @GiusCat in #10644
[core] Pyramid Attention Broadcast by @a-r-r-o-w in #9562
[fix] refer use_framewise_encoding on AutoencoderKLHunyuanVideo._encode by @hanchchch in #10600
Refactor gradient checkpointing by @a-r-r-o-w in #10611
[Tests] conditionally check fp8_e4m3_bf16_max_memory < fp8_e4m3_fp32_max_memory by @sayakpaul in #10669
Fix pipeline dtype unexpected change when using SDXL reference community pipelines in float16 mode by @dimitribarbot in #10670
[tests] update llamatokenizer in hunyuanvideo tests by @sayakpaul in #10681
support StableDiffusionAdapterPipeline.from_single_file by @Teriks in #10552
fix(hunyuan-video): typo in height and width input check by @badayvedat in #10684
[FIX] check_inputs function in Auraflow Pipeline by @SahilCarterr in #10678
Fix enable memory efficient attention on ROCm by @tenpercent in #10564
Fix inconsistent random transform in instruct pix2pix by @Luvata in #10698
feat(training-utils): support device and dtype params in compute_density_for_timestep_sampling by @badayvedat in #10699
Fixed grammar in "write_own_pipeline" readme by @N0-Flux-given in #10706
Fix Documentation about Image-to-Image Pipeline by @ParagEkbote in #10704
[bitsandbytes] Simplify bnb int8 dequant by @sayakpaul in #10401
Fix train_text_to_image.py --help by @nkthiebaut in #10711
Notebooks for Community Scripts-6 by @ParagEkbote in #10713
[Fix] Type Hint in from_pretrained() to Ensure Correct Type Inference by @SahilCarterr in #10714
add provider_options in from_pretrained by @xieofxie in #10719
[Community] Enhanced Model Search by @suzukimain in #10417
[bugfix] NPU Adaption for Sana by @leisuzz in #10724
Quantized Flux with IP-Adapter by @hlky in #10728
EDMEulerScheduler accept sigmas, add final_sigmas_type by @hlky in #10734
[LoRA] fix peft state dict parsing by @sayakpaul in #10532
Add Self type hint to ModelMixin's from_pretrained by @hlky in #10742
[Tests] Test layerwise casting with training by @sayakpaul in #10765
speedup hunyuan encoder causal mask generation by @dabeschte in #10764
[CI] Fix Truffle Hog failure by @DN6 in #10769
Add OmniGen by @staoxiao in #10148
feat: new community mixture_tiling_sdxl pipeline for SDXL by @elismasilva in #10759
Add support for lumina2 by @zhuole1025 in #10642
Refactor OmniGen by @a-r-r-o-w in #10771
Faster set_adapters by @Luvata in #10777
[Single File] Add Single File support for Lumina Image 2.0 Transformer by @DN6 in #10781
Fix use_lu_lambdas and use_karras_sigmas with beta_schedule=squaredcos_cap_v2 in DPMSolverMultistepScheduler by @hlky in #10740
MultiControlNetUnionModel on SDXL by @guiyrt in #10747
fix: [Community pipeline] Fix flattened elements on image by @elismasilva in #10774
make tensors contiguous before passing to safetensors by @faaany in #10761
Disable PEFT input autocast when using fp8 layerwise casting by @a-r-r-o-w in #10685
Update FlowMatch docstrings to mention correct output classes by @a-r-r-o-w in #10788
Refactor CogVideoX transformer forward by @a-r-r-o-w in #10789
Module Group Offloading by @a-r-r-o-w in #10503
Update Custom Diffusion Documentation for Multiple Concept Inference to resolve issue #10791 by @puhuk in #10792
[FIX] check_inputs function in lumina2 by @SahilCarterr in #10784
follow-up refactor on lumina2 by @yiyixuxu in #10776
CogView4 (supports different length c and uc) by @zRzRzRzRzRzRzR in #10649
typo fix by @YanivDorGalron in #10802
Extend Support for callback_on_step_end for AuraFlow and LuminaText2Img Pipelines by @ParagEkbote in #10746
[chore] update notes generation spaces by @sayakpaul in #10592
[LoRA] improve lora support for flux. by @sayakpaul in #10810
Fix max_shift value in flux and related functions to 1.15 (issue #10675) by @puhuk in #10807
[docs] add missing entries to the lora docs. by @sayakpaul in #10819
DiffusionPipeline mixin to+FromOriginalModelMixin/FromSingleFileMixin from_single_file type hint by @hlky in #10811
[LoRA] make set_adapters() robust on silent failures. by @sayakpaul in #9618
[FEAT] Model loading refactor by @SunMarc in #10604
[misc] feat: introduce a style bot. by @sayakpaul in #10274
Remove print statements by @a-r-r-o-w in #10836
[tests] use proper gemma class and config in lumina2 tests. by @sayakpaul in #10828
[LoRA] add LoRA support to Lumina2 and fine-tuning script by @sayakpaul in #10818
[Utils] add utilities for checking if certain utilities are properly documented by @sayakpaul in #7763
Add missing isinstance for arg checks in GGUFParameter by @AstraliteHeart in #10834
[tests] test encode_prompt() in isolation by @sayakpaul in #10438
store activation cls instead of function by @SunMarc in #10832
fix: support transformer models' generation_config in pipeline by @JeffersonQin in #10779
Notebooks for Community Scripts-7 by @ParagEkbote in #10846
[CI] install accelerate transformers from main by @sayakpaul in #10289
[CI] run fast gpu tests conditionally on pull requests. by @sayakpaul in #10310
SD3 IP-Adapter runtime checkpoint conversion by @guiyrt in #10718
Some consistency-related fixes for HunyuanVideo by @a-r-r-o-w in #10835
SkyReels Hunyuan T2V & I2V by @a-r-r-o-w in #10837
fix: run tests from a pr workflow. by @sayakpaul in #9696
[chore] template for remote vae. by @sayakpaul in #10849
fix remote vae template by @sayakpaul in #10852
[CI] Fix incorrectly named test module for Hunyuan DiT by @DN6 in #10854
[CI] Update always test Pipelines list in Pipeline fetcher by @DN6 in #10856
device_map in load_model_dict_into_meta by @hlky in #10851
[Fix] Docs overview.md by @SahilCarterr in #10858
remove format check for safetensors file by @SunMarc in #10864
[docs] LoRA support by @stevhliu in #10844
Comprehensive type checking for from_pretrained kwargs by @guiyrt in #10758
Fix torch_dtype in Kolors text encoder with transformers v4.49 by @hlky in #10816
[LoRA] restrict certain keys to be checked for peft config update. by @sayakpaul in #10808
Add SD3 ControlNet to AutoPipeline by @hlky in #10888
[docs] Update prompt weighting docs by @stevhliu in #10843
[docs] Flux group offload by @stevhliu in #10847
[Fix] fp16 unscaling in train_dreambooth_lora_sdxl by @SahilCarterr in #10889
[docs] Add CogVideoX Schedulers by @a-r-r-o-w in #10885
[chore] correct qk norm list. by @sayakpaul in #10876
[Docs] Fix toctree sorting by @DN6 in #10894
[refactor] SD3 docs & remove additional code by @a-r-r-o-w in #10882
[refactor] Remove additional Flux code by @a-r-r-o-w in #10881
[CI] Improvements to conditional GPU PR tests by @DN6 in #10859
Multi IP-Adapter for Flux pipelines by @guiyrt in #10867
Fix Callback Tensor Inputs of the SDXL Controlnet Inpaint and Img2img Pipelines are missing "controlnet_image". by @CyberVy in #10880
Security fix by @ydshieh in #10905
Marigold Update: v1-1 models, Intrinsic Image Decomposition pipeline, documentation by @toshas in #10884
[Tests] fix: lumina2 lora fuse_nan test by @sayakpaul in #10911
Fix Callback Tensor Inputs of the SD Controlnet Pipelines are missing some elements. by @CyberVy in #10907
[CI] Fix Fast GPU tests on PR by @DN6 in #10912
[CI] Fix for failing IP Adapter test in Fast GPU PR tests by @DN6 in #10915
Experimental per control type scale for ControlNet Union by @hlky in #10723
[style bot] improve security for the stylebot. by @sayakpaul in #10908
[CI] Update Stylebot Permissions by @DN6 in #10931
[Alibaba Wan Team] continue on #10921 Wan2.1 by @yiyixuxu in #10922
Support IPAdapter for more Flux pipelines by @hlky in #10708
Add remote_decode to remote_utils by @hlky in #10898
Update VAE Decode endpoints by @hlky in #10939
[chore] fix-copies to flux pipelines by @sayakpaul in #10941
[Tests] Remove more encode prompts tests by @sayakpaul in #10942
Add EasyAnimateV5.1 text-to-video, image-to-video, control-to-video generation model by @bubbliiiing in #10626
Fix SD2.X clip single file load projection_dim by @Teriks in #10770
add from_single_file to animatediff by @ in #10924
Add Example of IPAdapterScaleCutoffCallback to Docs by @ParagEkbote in #10934
Update pipeline_cogview4.py by @zRzRzRzRzRzRzR in #10944
Fix redundant prev_output_channel assignment in UNet2DModel by @ahmedbelgacem in #10945
Improve load_ip_adapter RAM Usage by @CyberVy in #10948
[tests] make tests device-agnostic (part 4) by @faaany in #10508
Update evaluation.md by @sayakpaul in #10938
[LoRA] feat: support non-diffusers lumina2 LoRAs. by @sayakpaul in #10909
[Quantization] support pass MappingType for TorchAoConfig by @a120092009 in #10927
Fix the missing parentheses when calling is_torchao_available in quantization_config.py. by @CyberVy in #10961
[LoRA] Support Wan by @a-r-r-o-w in #10943
Fix incorrect seed initialization when args.seed is 0 by @azolotenkov in #10964
feat: add Mixture-of-Diffusers ControlNet Tile upscaler Pipeline for SDXL by @elismasilva in #10951
[Docs] CogView4 comment fix by @zRzRzRzRzRzRzR in #10957
update check_input for cogview4 by @yiyixuxu in #10966
Add VAE Decode endpoint slow test by @hlky in #10946
[flux lora training] fix t5 training bug by @linoytsaban in #10845
use style bot GH Action from huggingface_hub by @hanouticelina in #10970
[train_dreambooth_lora.py] Fix the LR Schedulers when num_train_epochs is passed in a distributed training env by @flyxiv in #10973
[tests] fix tests for save load components by @sayakpaul in #10977
Fix loading OneTrainer Flux LoRA by @hlky in #10978
fix default values of Flux guidance_scale in docstrings by @catwell in #10982
[CI] remove synchornized. by @sayakpaul in #10980
Bump jinja2 from 3.1.5 to 3.1.6 in /examples/research_projects/realfill by @dependabot[bot] in #10984
Fix Flux Controlnet Pipeline _callback_tensor_inputs Missing Some Elements by @CyberVy in #10974
[Single File] Add user agent to SF download requests. by @DN6 in #10979
Add CogVideoX DDIM Inversion to Community Pipelines by @LittleNyima in #10956
fix wan i2v pipeline bugs by @yupeng1111 in #10975
Hunyuan I2V by @a-r-r-o-w in #10983
Fix Graph Breaks When Compiling CogView4 by @chengzeyi in #10959
Wan VAE move scaling to pipeline by @hlky in #10998
[LoRA] remove full key prefix from peft. by @sayakpaul in #11004
[Single File] Add single file support for Wan T2V/I2V by @DN6 in #10991
Add STG to community pipelines by @kinam0252 in #10960
[LoRA] Improve copied from comments in the LoRA loader classes by @sayakpaul in #10995
Fix for fetching variants only by @DN6 in #10646
[Quantization] Add Quanto backend by @DN6 in #10756
[Single File] Add single file loading for SANA Transformer by @ishan-modi in #10947
[LoRA] Improve warning messages when LoRA loading becomes a no-op by @sayakpaul in #10187
[LoRA] CogView4 by @a-r-r-o-w in #10981
[Tests] improve quantization tests by additionally measuring the inference memory savings by @sayakpaul in #11021
[Research Project] Add AnyText: Multilingual Visual Text Generation And Editing by @tolgacangoz in #8998
[Quantization] Allow loading TorchAO serialized Tensor objects with torch>=2.6 by @DN6 in #11018
fix: mixture tiling sdxl pipeline - adjust gerating time_ids & embeddings by @elismasilva in #11012
[LoRA] support wan i2v loras from the world. by @sayakpaul in #11025
Fix SD3 IPAdapter feature extractor by @hlky in #11027
chore: fix help messages in advanced diffusion examples by @wonderfan in #10923
Fix missing **kwargs in lora_pipeline.py by @CyberVy in #11011
Fix for multi-GPU WAN inference by @AmericanPresidentJimmyCarter in #10997
[Refactor] Clean up import utils boilerplate by @DN6 in #11026
Use output_size in repeat_interleave by @hlky in #11030
[hybrid inference đŻđ] Add VAE encode by @hlky in #11017
Wan Pipeline scaling fix, type hint warning, multi generator fix by @hlky in #11007
[LoRA] change to warning from info when notifying the users about a LoRA no-op by @sayakpaul in #11044
Rename Lumina(2)Text2ImgPipeline -> Lumina(2)Pipeline by @hlky in #10827
making formatted_images initialization compact by @YanivDorGalron in #10801
Fix aclnnRepeatInterleaveIntWithDim error on NPU for get_1d_rotary_pos_embed by @ZhengKai91 in #10820
[Tests] restrict memory tests for quanto for certain schemes. by @sayakpaul in #11052
[LoRA] feat: support non-diffusers wan t2v loras. by @sayakpaul in #11059
[examples/controlnet/train_controlnet_sd3.py] Fixes #11050 - Cast prompt_embeds and pooled_prompt_embeds to weight_dtype to prevent dtype mismatch by @andjoer in #11051
reverts accidental change that removes attn_mask in attn. Improves fl⌠by @entrpn in #11065
Fix deterministic issue when getting pipeline dtype and device by @dimitribarbot in #10696
[Tests] add requires peft decorator. by @sayakpaul in #11037
CogView4 Control Block by @zRzRzRzRzRzRzR in #10809
[CI] pin transformers version for benchmarking. by @sayakpaul in #11067
Fix Wan I2V Quality by @chengzeyi in #11087
LTX 0.9.5 by @a-r-r-o-w in #10968
make PR GPU tests conditioned on styling. by @sayakpaul in #11099
Group offloading improvements by @a-r-r-o-w in #11094
Fix pipeline_flux_controlnet.py by @co63oc in #11095
update readme instructions. by @entrpn in #11096
Resolve stride mismatch in UNet's ResNet to support Torch DDP by @jinc7461 in #11098
Fix Group offloading behaviour when using streams by @a-r-r-o-w in #11097
Quality options in export_to_video by @hlky in #11090
[CI] uninstall deps properly from pr gpu tests. by @sayakpaul in #11102
[BUG] Fix Autoencoderkl train script by @lavinal712 in #11113
[Wan LoRAs] make T2V LoRAs compatible with Wan I2V by @linoytsaban in #11107
[tests] enable bnb tests on xpu by @faaany in #11001
[fix bug] PixArt inference_steps=1 by @lawrence-cj in #11079
Flux with Remote Encode by @hlky in #11091
[tests] make cuda only tests device-agnostic by @faaany in #11058
Provide option to reduce CPU RAM usage in Group Offload by @DN6 in #11106
remove F.rms_norm for now by @yiyixuxu in #11126
Notebooks for Community Scripts-8 by @ParagEkbote in #11128
fix _callback_tensor_inputs of sd controlnet inpaint pipeline missing some elements by @CyberVy in #11073
[core] FasterCache by @a-r-r-o-w in #10163
add sana-sprint by @yiyixuxu in #11074
Don't override torch_dtype and don't use when quantization_config is set by @hlky in #11039
Update README and example code for AnyText usage by @tolgacangoz in #11028
Modify the implementation of retrieve_timesteps in CogView4-Control. by @zRzRzRzRzRzRzR in #11125
[fix SANA-Sprint] by @lawrence-cj in #11142
New HunyuanVideo-I2V by @a-r-r-o-w in #11066
[doc] Fix Korean Controlnet Train doc by @flyxiv in #11141
Improve information about group offloading and layerwise casting by @a-r-r-o-w in #11101
add a timestep scale for sana-sprint teacher model by @lawrence-cj in #11150
[Quantization] dtype fix for GGUF + fix BnB tests by @DN6 in #11159
Set self._hf_peft_config_loaded to True when LoRA is loaded using load_lora_adapter in PeftAdapterMixin class by @kentdan3msu in #11155
WanI2V encode_image by @hlky in #11164
[Docs] Update Wan Docs with memory optimizations by @DN6 in #11089
Fix LatteTransformer3DModel dtype mismatch with enable_temporal_attentions by @hlky in #11139
Raise warning and round down if Wan num_frames is not 4k + 1 by @a-r-r-o-w in #11167
[Docs] Fix environment variables in installation.md by @remarkablemark in #11179
Add latents_mean and latents_std to SDXLLongPromptWeightingPipeline by @hlky in #11034
Bug fix in LTXImageToVideoPipeline.prepare_latents() when latents is already set by @kakukakujirori in #10918
[tests] no hard-coded cuda by @faaany in #11186
[WIP] Add Wan Video2Video by @DN6 in #11053
map BACKEND_RESET_MAX_MEMORY_ALLOCATED to reset_peak_memory_stats on XPU by @yao-matrix in #11191
fix autocast by @jiqing-feng in #11190
fix: for checking mandatory and optional pipeline components by @elismasilva in #11189
remove unnecessary call to F.pad by @bm-synth in #10620
allow models to run with a user-provided dtype map instead of a single dtype by @hlky in #10301
[tests] HunyuanDiTControlNetPipeline inference precision issue on XPU by @faaany in #11197
Revert save_model in ModelMixin save_pretrained and use safe_serialization=False in test by @hlky in #11196
[docs] torch_dtype map by @hlky in #11194
Fix enable_sequential_cpu_offload in CogView4Pipeline by @hlky in #11195
SchedulerMixin from_pretrained and ConfigMixin Self type annotation by @hlky in #11192
Update import_utils.py by @Lakshaysharma048 in #10329
Add CacheMixin to Wan and LTX Transformers by @DN6 in #11187
feat: [Community Pipeline] - FaithDiff Stable Diffusion XL Pipeline by @elismasilva in #11188
[Model Card] standardize advanced diffusion training sdxl lora by @chiral-carbon in #7615
Change KolorsPipeline LoRA Loader to StableDiffusion by @BasileLewan in #11198
Update Style Bot workflow by @hanouticelina in #11202
Fixed requests.get function call by adding timeout parameter. by @kghamilton89 in #11156
Fix Single File loading for LTX VAE by @DN6 in #11200
[feat]Add strength in flux_fill pipeline (denoising strength for fluxfill) by @Suprhimp in #10603
[LTX0.9.5] Refactor LTXConditionPipeline for text-only conditioning by @tolgacangoz in #11174
Add Wan with STG as a community pipeline by @Ednaordinary in #11184
Add missing MochiEncoder3D.gradient_checkpointing attribute by @mjkvaak-amd in #11146
enable 1 case on XPU by @yao-matrix in #11219
ensure dtype match between diffused latents and vae weights by @heyalexchoi in #8391
[docs] MPS update by @stevhliu in #11212
Add support to pass image embeddings to the WAN I2V pipeline. by @goiri in #11175
[train_controlnet.py] Fix the LR schedulers when num_train_epochs is passed in a distributed training env by @Bhavay-2001 in #8461
[Training] Better image interpolation in training scripts by @asomoza in #11206
[LoRA] Implement hot-swapping of LoRA by @BenjaminBossan in #9453
introduce compute arch specific expectations and fix test_sd3_img2img_inference failure by @yao-matrix in #11227
[Flux LoRA] fix issues in flux lora scripts by @linoytsaban in #11111
Flux quantized with lora by @hlky in #10990
[feat] implement record_stream when using CUDA streams during group offloading by @sayakpaul in #11081
[bistandbytes] improve replacement warnings for bnb by @sayakpaul in #11132
minor update to sana sprint docs. by @sayakpaul in #11236
[docs] minor updates to dtype map docs. by @sayakpaul in #11237
[LoRA] support more comyui loras for Flux đ¨ by @sayakpaul in #10985
fix: SD3 ControlNet validation so that it runs on a A100. by @sayakpaul in #11238
AudioLDM2 Fixes by @hlky in #11244
AutoModel by @hlky in #11115
fix FluxReduxSlowTests::test_flux_redux_inference case failure on XPU by @yao-matrix in #11245
[docs] AutoModel by @hlky in #11250
Update Ruff to latest Version by @DN6 in #10919
fix flux controlnet bug by @free001style in #11152
fix timeout constant by @sayakpaul in #11252
fix consisid imports by @sayakpaul in #11254
Release: v0.33.0 by @sayakpaul (direct commit on v0.33.0-release)
Significant community contributions
The following contributors have made significant changes to the library over the last release:
@guiyrt
IP-Adapter for StableDiffusion3Img2ImgPipeline (#10589)
[Docs] Update SD3 ip_adapter model_id to diffusers checkpoint (#10597)