Zstandard v1.5.4
Zstandard v1.5.4 is a pretty big release benefiting from one year of work, spread over > 650 commits. It offers significant performance improvements across multiple scenarios, as well as new features (detailed below). There is a crop of little bug fixes too, a few ones targeting the 32-bit mode are important enough to make this release a recommended upgrade.
Various Speed improvements
This release has accumulated a number of scenario-specific improvements, that cumulatively benefit a good portion of installed base in one way or another.
Among the easier ones to describe, the repository has received several contributions for arm optimizations, notably from @JunHe77 and @danlark1. And @terrelln has improved decompression speed for non-x64 systems, including arm. The combination of this work is visible in the following example, using an M1-Pro (aarch64 architecture) :
| cpu | function | corpus | v1.5.2 | v1.5.4 | Improvement |
| --- | --- | --- | --- | --- | --- |
| M1 Pro | decompress | silesia.tar | 1370 MB/s | 1480 MB/s | + 8% |
| Galaxy S22 | decompress | silesia.tar | 1150 MB/s | 1200 MB/s | + 4% |
Middle compression levels (5-12) receive some care too, with @terrelln improving the dispatch engine, and @danlark1 offering NEON optimizations. Exact speed up vary depending on platform, cpu, compiler, and compression level, though one can expect gains ranging from +1 to +10% depending on scenarios.
| cpu | function | corpus | v1.5.2 | v1.5.4 | Improvement |
| --- | --- | --- | ---:| ---:| --- |
| i7-9700k | compress -6 | silesia.tar | 110 MB/s | 121 MB/s | +10%
| Galaxy S22 | compress -6 | silesia.tar | 98 MB/s | 103 MB/s | +5%
| M1 Pro | compress -6 | silesia.tar | 122 MB/s | 130 MB/s | +6.5%
| i7-9700k | compress -9 | silesia.tar | 64 MB/s | 70 MB/s | +9.5%
| Galaxy S22 | compress -9 | silesia.tar | 51 MB/s | 52 MB/s | +1%
| M1 Pro | compress -9 | silesia.tar | 77 MB/s | 86 MB/s | +11.5%
| i7-9700k | compress -12 | silesia.tar | 31.6 MB/s | 31.8 MB/s | +0.5%
| Galaxy S22 | compress -12 | silesia.tar | 20.9 MB/s | 22.1 MB/s | +5%
| M1 Pro | compress -12 | silesia.tar | 36.1 MB/s | 39.7 MB/s | +10%
Speed of the streaming compression interface has been improved by @embg in scenarios involving large files (where size is a multiple of the windowSize parameter). The improvement is mostly perceptible at high speeds (i.e. ~level 1). In the following sample, the measurement is taken directly at ZSTD_compressStream() function call, using a dedicated benchmark tool tests/fullbench.
| cpu | function | corpus | v1.5.2 | v1.5.4 | Improvement |
| --- | --- | --- | --- | --- | --- |
| i7-9700k | ZSTD_compressStream() -1 | silesia.tar | 392 MB/s | 429 MB/s | +9.5% |
| Galaxy S22 | ZSTD_compressStream() -1 | silesia.tar | 380 MB/s | 430 MB/s | +13% |
| M1 Pro | ZSTD_compressStream() -1 | silesia.tar | 476 MB/s | 539 MB/s | +13% |
Finally, dictionary compression speed has received a good boost by @embg. Exact outcome varies depending on system and corpus. The following result is achieved by cutting the enwik8 compression corpus into 1KB blocks, generating a dictionary from these blocks, and then benchmarking the compression speed at level 1.
| cpu | function | corpus | v1.5.2 | v1.5.4 | Improvement |
| --- | --- | --- | --- | --- | --- |
| i7-9700k | dictionary compress | enwik8 -B1K | 125 MB/s | 165 MB/s | +32% |
| Galaxy S22 | dictionary compress | enwik8 -B1K | 138 MB/s | 166 MB/s | +20% |
| M1 Pro | dictionary compress | enwik8 -B1K | 155 MB/s | 195 MB/s | +25 % |
There are a few more scenario-specifics improvements listed in the changelog section below.
I/O Performance improvements
The 1.5.4 release improves IO performance of zstd CLI, by using system buffers (macos) and adding a new asynchronous I/O capability, enabled by default on large files (when threading is available). The user can also explicitly control this capability with the --[no-]asyncio flag . These new threads remove the need to block on IO operations. The impact is mostly noticeable when decompressing large files (>= a few MBs), though exact outcome depends on environment and run conditions.
Decompression speed gets significant gains due to its single-threaded serial nature and the high speeds involved. In some cases we observe up to double performance improvement (local Mac machines) and a wide +15-45% benefit on Intel Linux servers (see table for details).
On the compression side of things, we’ve measured up to 5% improvements. The impact is lower because compression is already partially asynchronous via the internal MT mode (see release v1.3.4).
The following table shows the elapsed run time for decompressions of silesia and enwik8 on several platforms - some Skylake-era Linux servers and an M1 MacbookPro. It compares the time it takes for version v1.5.2 to version v1.5.4 with asyncio on and off.
platform | corpus | v1.5.2 | v1.5.4-no-asyncio | v1.5.4 | Improvement
-- | -- | -- | -- | -- | --
Xeon D-2191A CentOS8 | enwik8 | 280 MB/s | 280 MB/s | 324 MB/s | +16%
Xeon D-2191A CentOS8 | silesia.tar | 303 MB/s | 302 MB/s | 386 MB/s | +27%
i7-1165g7 win10 | enwik8 | 270 MB/s | 280 MB/s | 350 MB/s | +27%
i7-1165g7 win10 | silesia.tar | 450 MB/s | 440 MB/s | 580 MB/s | +28%
i7-9700K Ubuntu20 | enwik8 | 600 MB/s | 604 MB/s | 829 MB/s | +38%
i7-9700K Ubuntu20 | silesia.tar | 683 MB/s | 678 MB/s | 991 MB/s | +45%
Galaxy S22 | enwik8 | 360 MB/s | 420 MB/s | 515 MB/s | +70%
Galaxy S22 | silesia.tar | 310 MB/s | 320 MB/s | 580 MB/s | +85%
MBP M1 | enwik8 | 428 MB/s | 734 MB/s | 815 MB/s | +90%
MBP M1 | silesia.tar | 465 MB/s | 875 MB/s | 1001 MB/s | +115%
Support of externally-defined sequence producers
libzstd can now support external sequence producers via a new advanced registration function ZSTD_registerSequenceProducer() (#3333).
This API allows users to provide their own custom sequence producer which libzstd invokes to process each block. The produced list of sequences (literals and matches) is then post-processed by libzstd to produce valid compressed blocks.
This block-level offload API is a more granular complement of the existing frame-level offload API compressSequences() (introduced in v1.5.1). It offers an easier migration story for applications already integrated with libzstd: the user application continues to invoke the same compression functions ZSTD_compress2() or ZSTD_compressStream2() as usual, and transparently benefits from the specific properties of the external sequence producer. For example, the sequence producer could be tuned to take advantage of known characteristics of the input, to offer better speed / ratio.
One scenario that becomes possible is to combine this capability with hardware-accelerated matchfinders, such as the Intel® QuickAssist accelerator (Intel® QAT) provided in server CPUs such as the 4th Gen Intel® Xeon® Scalable processors (previously codenamed Sapphire Rapids). More details to be provided in future communications.
Change Log
perf: +20% faster huffman decompression for targets that can't compile x64 assembly (#3449, @terrelln)
perf: up to +10% faster streaming compression at levels 1-2 (#3114, @embg)
perf: +4-13% for levels 5-12 by optimizing function generation (#3295, @terrelln)
pref: +3-11% compression speed for arm target (#3199, #3164, #3145, #3141, #3138, @JunHe77 and #3139, #3160, @danlark1)
perf: +5-30% faster dictionary compression at levels 1-4 (#3086, #3114, #3152, @embg)
perf: +10-20% cold dict compression speed by prefetching CDict tables (#3177, @embg)
perf: +1% faster compression by removing a branch in ZSTD_fast_noDict (#3129, @felixhandte)
perf: Small compression ratio improvements in high compression mode (#2983, #3391, @Cyan4973 and #3285, #3302, @daniellerozenblit)
perf: small speed improvement by better detecting STATIC_BMI2 for clang (#3080, @TocarIP)
perf: Improved streaming performance when ZSTD_c_stableInBuffer is set (#2974, @Cyan4973)
cli: Asynchronous I/O for improved cli speed (#2975, #2985, #3021, #3022, @yoniko)
cli: Change zstdless behavior to align with zless (#2909, @binhdvo)
cli: Keep original file if -c or --stdout is given (#3052, @dirkmueller)
cli: Keep original files when result is concatenated into a single output with -o (#3450, @Cyan4973)
cli: Preserve Permissions and Ownership of regular files (#3432, @felixhandte)
cli: Print zlib/lz4/lzma library versions with -vv (#3030, @terrelln)
cli: Print checksum value for single frame files with -lv (#3332, @Cyan4973)
cli: Print dictID when present with -lv (#3184, @htnhan)
cli: when stderr is not the console, disable status updates, but preserve final summary (#3458, @Cyan4973)
cli: support --best and --no-name in gzip compatibility mode (#3059, @dirkmueller)
cli: support for posix high resolution timer clock_gettime(), for improved benchmark accuracy (#3423, @Cyan4973)
cli: improved help/usage (-h, -H) formatting (#3094, @dirkmueller and #3385, @jonpalmisc)
cli: Fix better handling of bogus numeric values (#3268, @ctkhanhly)
cli: Fix input consists of multiple files and stdin (#3222, @yoniko)
cli: Fix tiny files passthrough (#3215, @cgbur)
cli: Fix for -r on empty directory (#3027, @brailovich)
cli: Fix empty string as argument for --output-dir-* (#3220, @embg)
cli: Fix decompression memory usage reported by -vv --long (#3042, @u1f35c, and #3232, @zengyijing)
cli: Fix infinite loop when empty input is passed to trainer (#3081, @terrelln)
cli: Fix doesn't work when is also set (#3354, @terrelln)
api: Support for External Sequence Producer (#3333, @embg)
api: Support for in-place decompression (#3432, @terrelln)
api: New function, set all parameters defined in a structure (#3403, @Cyan4973)
api: Streaming decompression detects incorrect header ID sooner (#3175, @Cyan4973)
api: Window size resizing optimization for edge case (#3345, @daniellerozenblit)
api: More accurate error codes for busy-loop scenarios (#3413, #3455, @Cyan4973)
api: Fix limit overflow in and (#3362, #3373, Cyan4973) reported by @nigeltao
api: Deprecate several advanced experimental functions: streaming (#3408, @embg), copy (#3196, @mileshu)
bug: Fix corruption that rarely occurs in 32-bit mode with wlog=25 (#3361, @terrelln)
bug: Fix for block-splitter (#3033, @Cyan4973)
bug: Fixes for Sequence Compression API (#3023, #3040, @Cyan4973)
bug: Fix leaking thread handles on Windows (#3147, @animalize)
bug: Fix timing issues with cmake/meson builds (#3166, #3167, #3170, @Cyan4973)
build: Allow user to select legacy level for cmake (#3050, @shadchin)
build: Enable legacy support by default in cmake (#3079, @niamster)
build: Meson build script improvements (#3039, #3120, #3122, #3327, #3357, @eli-schwartz and #3276, @neheb)
build: Add aarch64 to supported architectures for zstd_trace (#3054, @ooosssososos)
build: support AIX architecture (#3219, @qiongsiwu)
build: Fix build macro, which now reduces static library size by half (#3366, @terrelln)
build: Fix Windows issues with Multithreading translation layer (#3364, #3380, @yoniko) and ARM64 target (#3320, @cwoffenden)
build: Fix script (#3382, #3392, @terrelln and #3252 @Tachi107 and #3167 @Cyan4973)
doc: Updated man page, providing more details for mode (#3112, @Cyan4973)
doc: Add decompressor errata document (#3092, @terrelln)
misc: Enable Intel CET (#2992, #2994, @hjl-tools)
misc: Fix seekable format (#3058, @yhoogstrate and #3346, @daniellerozenblit)
misc: Improve speed of the one-file library generator (#3241, @wahern and #3005, @cwoffenden)
PR list (generated by Github)
- x86-64: Enable Intel CET by @hjl-tools in https://github.com/facebook/zstd/pull/2992
- Add GitHub Action Checking that Zstd Runs Successfully Under CET by @felixhandte in https://github.com/facebook/zstd/pull/3015
- [opt] minor compression ratio improvement by @Cyan4973 in https://github.com/facebook/zstd/pull/2983
- Simplify HUF_decompress4X2_usingDTable_internal_bmi2_asm_loop by @WojciechMula in https://github.com/facebook/zstd/pull/3013
- Async write for decompression by @yoniko in https://github.com/facebook/zstd/pull/2975
- ZSTD CLI: Use buffered output by @yoniko in https://github.com/facebook/zstd/pull/2985
- Use faster Python script to amalgamate by @cwoffenden in https://github.com/facebook/zstd/pull/3005
- Change zstdless behavior to align with zless by @binhdvo in https://github.com/facebook/zstd/pull/2909
- AsyncIO compression part 1 - refactor of existing asyncio code by @yoniko in https://github.com/facebook/zstd/pull/3021
- Converge sumtype (offset | repcode) numeric representation towards offBase by @Cyan4973 in https://github.com/facebook/zstd/pull/2965
- fix sequence compression API in Explicit Delimiter mode by @Cyan4973 in https://github.com/facebook/zstd/pull/3023
- Lazy parameters adaptation (part 1 - ZSTD_c_stableInBuffer) by @Cyan4973 in https://github.com/facebook/zstd/pull/2974
- Print zlib/lz4/lzma library versions in verbose version output by @terrelln in https://github.com/facebook/zstd/pull/3030
- fix for -r on empty directory by @brailovich in https://github.com/facebook/zstd/pull/3027
- Add new CLI testing platform by @terrelln in https://github.com/facebook/zstd/pull/3020
- AsyncIO compression part 2 - added async read and asyncio to compression code by @yoniko in https://github.com/facebook/zstd/pull/3022
- Macos playtest envvars fix by @yoniko in https://github.com/facebook/zstd/pull/3035
- Fix required decompression memory usage reported by -vv + --long by @u1f35c in https://github.com/facebook/zstd/pull/3042
- Select legacy level for cmake by @shadchin in https://github.com/facebook/zstd/pull/3050
- [trace] Add aarch64 to supported architectures for zstd_trace by @ooosssososos in https://github.com/facebook/zstd/pull/3054
- New features for largeNbDicts benchmark by @embg in https://github.com/facebook/zstd/pull/3063
- Use helper function for bit manipulations. by @TocarIP in https://github.com/facebook/zstd/pull/3075
- [programs] Fix infinite loop when empty input is passed to trainer by @terrelln in https://github.com/facebook/zstd/pull/3081
- Enable STATIC_BMI2 for gcc/clang by @TocarIP in https://github.com/facebook/zstd/pull/3080
- build:cmake: enable ZSTD legacy support by default by @niamster in https://github.com/facebook/zstd/pull/3079
New Contributors
- @WojciechMula made their first contribution in https://github.com/facebook/zstd/pull/3013
- @trixirt made their first contribution in https://github.com/facebook/zstd/pull/3026
- @brailovich made their first contribution in https://github.com/facebook/zstd/pull/3027
- @u1f35c made their first contribution in https://github.com/facebook/zstd/pull/3042
- @shadchin made their first contribution in https://github.com/facebook/zstd/pull/3050
- @ooosssososos made their first contribution in https://github.com/facebook/zstd/pull/3054
- @TocarIP made their first contribution in https://github.com/facebook/zstd/pull/3075
- @xry111 made their first contribution in https://github.com/facebook/zstd/pull/3084
- @niamster made their first contribution in https://github.com/facebook/zstd/pull/3079
- @dirkmueller made their first contribution in https://github.com/facebook/zstd/pull/3059
- @cyberknight777 made their first contribution in https://github.com/facebook/zstd/pull/3088
- @dpelle made their first contribution in https://github.com/facebook/zstd/pull/3095
- @paulmenzel made their first contribution in https://github.com/facebook/zstd/pull/3108
- @cuishuang made their first contribution in https://github.com/facebook/zstd/pull/3117
- @averred made their first contribution in https://github.com/facebook/zstd/pull/3135
- @JunHe77 made their first contribution in https://github.com/facebook/zstd/pull/3145
- @htnhan made their first contribution in https://github.com/facebook/zstd/pull/3184
- @udayanbapat made their first contribution in https://github.com/facebook/zstd/pull/3118
- @zhuhan0 made their first contribution in https://github.com/facebook/zstd/pull/3205
- @mgord9518 made their first contribution in https://github.com/facebook/zstd/pull/3218
- @qiongsiwu made their first contribution in https://github.com/facebook/zstd/pull/3219
- @orbea made their first contribution in https://github.com/facebook/zstd/pull/3217
- @cgbur made their first contribution in https://github.com/facebook/zstd/pull/3215
- @tomcwang made their first contribution in https://github.com/facebook/zstd/pull/3208
- @mileshu made their first contribution in https://github.com/facebook/zstd/pull/3196
- @zengyijing made their first contribution in https://github.com/facebook/zstd/pull/3226
- @grossws made their first contribution in https://github.com/facebook/zstd/pull/3230
- @wahern made their first contribution in https://github.com/facebook/zstd/pull/3241
- @daniellerozenblit made their first contribution in https://github.com/facebook/zstd/pull/3258
- @DimitriPapadopoulos made their first contribution in https://github.com/facebook/zstd/pull/3259
Full Automated Changelog: https://github.com/facebook/zstd/compare/v1.5.2...v1.5.4