zstd: Misc optimizations and refactoring in DecompressSequences_LdsFseCache. by Jonathan-Weinstein-AMD · Pull Request #73 · microsoft/DirectStorage

Jonathan-Weinstein-AMD · 2026-02-19T04:28:57Z

Main changes in this PR:

Add a if (threadId != 0) return after the ZSTDGPU_PRELOAD_FSE_INTO_LDS and GroupMemoryBarrierWithGroupSync is done, since the rest of the shader is all (what the current compiler does is a different story) scalar. This greatly helps RDNA3 and RDNA4 because they end up picking wave64 for compute shaders often. This change turns the high 32-bits of exec off, making it so the typical second pass of a wave64 VALU/VMEM/LDS instruction is skipped. An input zst file of ~146MB ("insects") does a Dispatch(15299, 1, 1) that on an 7900 XTX (RDNA3) with SetStablePowerState has a duration change of 1.81 ms -> 1.50 ms. Using DecompressSequences_LdsFseCache32 seems to reduce that more (~1.44 ms), but with the other changes in this PR DecompressSequences_LdsFseCache{32,64} are pretty close on Navi31, and maybe the TG size of 64 could be better for other data sets, so keep the PSO variant selection the same now.
Restructure the loop condition to be in the middle so ZSTGPU_DECODE_SEQ is invoked just once. This is mainly for easier performance analysis; the runtime performance does not seem affected.
Do alternate firstbithigh for known nonzero input to workaround current AMD GPU compiler from not emitting scalar instructions. This tidies up the ISA a bit (the bitbuffer InitWithSegment part), but the loop is still an issue.
Merge/pack SEQ_{LITERAL, MATCH}_LENGTH_BASELINES and SEQ_{LITERAL, MATCH}_LENGTH_EXTRA_BITS arrays into one array. There was a separate s_load_b32 with its own waitcnt, and it also reduces duplicate indexing and OOB-handling code the compiler emits. Navi31 XTX SetStablePowerState duration for ~146MB "insects" archive: 1.44 ms -> 1.35 ms. See commit in the PR for the before/after ISA.
Raw-buffer-ize all backwards bitbuffers for consistency with HuffmanStream, and because ByteAddressBuffer is more friendly to SMEM (SMEM doesn't do stride-mul itself), though that last point may only be relevant after future compiler changes.

Diff is best viewed when ignoring whitespace changes.

This PR should probably be squashed if/when merged; I didn't force-push anything that would change commit hashes.

An input zst file of ~146MB does a Dispatch(15299, 1, 1). On an 7900 XTX (RDNA3) with SetStablePowerState, duration change is 1.811 ms -> 1.441 ms. There could perhaps be more perf testing for this, but A nearby #ifdef suggests a threadgroup size of 32 is also better on RDNA2 XBOX. Current IHV-compiler output seems lacking in some areas, if that is improved the preference of threadgroup size may change.

…bail after LDS stores. Also remove the condition for the GroupMemoryBarrierWithGroupSync. The IHV compiler should not emit a barrier if it isn't needed and be capable or removing empty blocks. s_barrier also behaves like an s_nop for single-wave-threadgroups on AMD. This does not apply to single-wave-threadgroups and DeviceMemoryBarrierWithGroupSync.

…ECODE_SEQ twice. The benefit is less compiled code for easier performance analysis. Performance seems about the same. The ZSTGPU_DECODE_SEQ macro could be removed, but its kept for any future potential experimentation (and to reduce the diff when ignoring whitespace).

…d/ran this at one point before and it was faster for the loop to only have 1 termination condition. Added a comment near the #if 0 about raw buffers.

… current AMD GPU compiler from not emitting scalar instructions. This tidies up the ISA for DecompressSequences_LdsFseCache a bit (the bitbuffer InitWithSegment part), but the loop is still an issue.

…AL, MATCH}_LENGTH_EXTRA_BITS arrays into one array. There was a separate s_load_b32 with its own waitcnt, and it also reduces duplicate indexing and OOB-handling code the compiler emits. Navi31 XTX SetStablePowerState duration for ~146MB "insects" archive: 1.44 ms -> 1.35 ms Before (some instructions are omitted): s_getpc_b64 s[24:25] s_add_u32 s24, s24, lit(0x00000514) s_addc_u32 s25, s25, 0 --------------------------------------- s_lshl_b32 s15, s15, 2 s_add_i32 s26, s15, lit(0x00000200) s_min_u32 s26, s26, lit(0x000002d4) s_load_b32 s26, s[24:25], s26 s_waitcnt lgkmcnt(0) --------------------------------------- s_lshl_b32 s27, s27, 2 s_addk_i32 s27, 0x0090 s_min_u32 s27, s27, lit(0x000002d4) s_load_b32 s27, s[24:25], s27 s_waitcnt lgkmcnt(0) --------------------------------------- s_addk_i32 s15, 0x0120 s_lshl_b32 s6, s6, 2 s_min_u32 s15, s15, lit(0x000002d4) s_min_u32 s6, s6, lit(0x000002d4) s_load_b32 s15, s[24:25], s15 s_load_b32 s6, s[24:25], s6 s_waitcnt lgkmcnt(0) After (some instructions are omitted): s_getpc_b64 s[24:25] s_add_u32 s24, s24, lit(0x000004fc) s_addc_u32 s25, s25, 0 --------------------------------------- s_lshl_b32 s15, s15, 2 s_addk_i32 s15, 0x0090 s_min_u32 s15, s15, lit(0x00000164) s_load_b32 s15, s[24:25], s15 s_waitcnt lgkmcnt(0) v_and_b32 v12, s15, 31 // ignore VALU now v_lshrrev_b32 v14, 5, s15 // ignore VALU now --------------------------------------- s_lshl_b32 s6, s6, 2 s_min_u32 s6, s6, lit(0x00000164) s_load_b32 s6, s[24:25], s6 s_waitcnt lgkmcnt(0) v_and_b32 v7, s6, 31 // ignore VALU now v_lshrrev_b32 v7, 5, s6 // ignore VALU now

…do a built-in stride-mul, so StructuredBuffer isn't as friendly to SMEM loads. The current AMD driver compiler is still generating buffer_load (VMEM, no s_ prefix) in the main DecompressSequences_LdsFseCache loop, but hopefully that'll be fixed in the future (if this pass still is still structured similarly). Deleted ZSTDGPU_USE_REVERSED_BIT_BUFFER_OFFSET so there are less paths. I don't see a benefit in what it is now, looks like the same amount of ALU, but it needs an extra field. I did this to ensure (more) testing for zstdgpu_Backward_BitBuffer (no _V0): ```diff diff --git a/zstd/zstdgpu/zstdgpu_shaders.h b/zstd/zstdgpu/zstdgpu_shaders.h index f3f61cf..86f7885 100644 --- a/zstd/zstdgpu/zstdgpu_shaders.h +++ b/zstd/zstdgpu/zstdgpu_shaders.h @@ -3575,8 +3575,8 @@ static void zstdgpu_ShaderEntry_DecompressSequences_LdsFseCache(ZSTDGPU_PARAM_IN # error `ZSTDGPU_BACKWARD_BITBUF` must not be defined. #endif - zstdgpu_Backward_BitBuffer_V0 bitBuffer; - #define ZSTDGPU_BACKWARD_BITBUF(method) zstdgpu_Backward_BitBuffer_V0_##method + zstdgpu_Backward_BitBuffer bitBuffer; + #define ZSTDGPU_BACKWARD_BITBUF(method) zstdgpu_Backward_BitBuffer_##method ZSTDGPU_BACKWARD_BITBUF(InitWithSegment)(bitBuffer, srt.inCompressedData, seqRef.src); #define ZSTDGPU_INIT_FSE_STATE(name) \ @@ -3671,7 +3671,7 @@ static void zstdgpu_ShaderEntry_DecompressSequences_LdsFseCache(ZSTDGPU_PARAM_IN stateMLen = nstateMLen + restMLen; stateOffs = nstateOffs + restOffs; } - ZSTDGPU_ASSERT(bitBuffer.hadlastrefill && bitBuffer.bitcnt == 0); + // ZSTDGPU_ASSERT(bitBuffer.hadlastrefill && bitBuffer.bitcnt == 0); #undef ZSTDGPU_BACKWARD_BITBUF } ```

…NA." This reverts commit e1fe0f9. With the `if (threadId != 0) { return; }` change, the performance difference between TG size of 32 (always wave32) and TG size of 64 (probably wave64 on Navi3, and more likely on navi4). In the ~146MB insects archive on Navi31 TG size of 32 is maybe still slightly faster (by 5-10 us with SetStablePowerState), but for other data sets that maybe have bigger FSE tables and with less sequences maybe TG size of 64 could fare better.

Jonathan-Weinstein-AMD added 4 commits February 18, 2026 19:35

zstd: Remove zstdgpu_Backward_BitBuffer_V0_CanRefill_Huffman. I teste…

f4dbb31

…d/ran this at one point before and it was faster for the loop to only have 1 termination condition. Added a comment near the #if 0 about raw buffers.

Jonathan-Weinstein-AMD marked this pull request as draft February 19, 2026 20:08

Jonathan-Weinstein-AMD added 5 commits February 19, 2026 17:48

zstd: Do alternate firstbithigh for known nonzero input to workaround…

3c900c8

… current AMD GPU compiler from not emitting scalar instructions. This tidies up the ISA for DecompressSequences_LdsFseCache a bit (the bitbuffer InitWithSegment part), but the loop is still an issue.

zstd: fix typos in previous commit (uniform-clz-workaround)

1224922

Jonathan-Weinstein-AMD changed the title ~~zstd/DecompressSequences_LdsFseCache: Prefer TG size of 32 instead of 64 and some refactoring.~~ zstd: Misc optimizations and refactoring in DecompressSequences_LdsFseCache. Feb 20, 2026

Jonathan-Weinstein-AMD marked this pull request as ready for review February 20, 2026 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

zstd: Misc optimizations and refactoring in DecompressSequences_LdsFseCache.#73

zstd: Misc optimizations and refactoring in DecompressSequences_LdsFseCache.#73
Jonathan-Weinstein-AMD wants to merge 9 commits intomicrosoft:developmentfrom
Jonathan-Weinstein-AMD:decompress-sequences-lds-fse-cleanup

Jonathan-Weinstein-AMD commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Jonathan-Weinstein-AMD commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jonathan-Weinstein-AMD commented Feb 19, 2026 •

edited

Loading