Skip to content

Comments

zstd: Misc optimizations and refactoring in DecompressSequences_LdsFseCache.#73

Open
Jonathan-Weinstein-AMD wants to merge 9 commits intomicrosoft:developmentfrom
Jonathan-Weinstein-AMD:decompress-sequences-lds-fse-cleanup
Open

zstd: Misc optimizations and refactoring in DecompressSequences_LdsFseCache.#73
Jonathan-Weinstein-AMD wants to merge 9 commits intomicrosoft:developmentfrom
Jonathan-Weinstein-AMD:decompress-sequences-lds-fse-cleanup

Conversation

@Jonathan-Weinstein-AMD
Copy link

@Jonathan-Weinstein-AMD Jonathan-Weinstein-AMD commented Feb 19, 2026

Main changes in this PR:

  1. Add a if (threadId != 0) return after the ZSTDGPU_PRELOAD_FSE_INTO_LDS and GroupMemoryBarrierWithGroupSync is done, since the rest of the shader is all (what the current compiler does is a different story) scalar. This greatly helps RDNA3 and RDNA4 because they end up picking wave64 for compute shaders often. This change turns the high 32-bits of exec off, making it so the typical second pass of a wave64 VALU/VMEM/LDS instruction is skipped. An input zst file of ~146MB ("insects") does a Dispatch(15299, 1, 1) that on an 7900 XTX (RDNA3) with SetStablePowerState has a duration change of 1.81 ms -> 1.50 ms. Using DecompressSequences_LdsFseCache32 seems to reduce that more (~1.44 ms), but with the other changes in this PR DecompressSequences_LdsFseCache{32,64} are pretty close on Navi31, and maybe the TG size of 64 could be better for other data sets, so keep the PSO variant selection the same now.
  2. Restructure the loop condition to be in the middle so ZSTGPU_DECODE_SEQ is invoked just once. This is mainly for easier performance analysis; the runtime performance does not seem affected.
  3. Do alternate firstbithigh for known nonzero input to workaround current AMD GPU compiler from not emitting scalar instructions. This tidies up the ISA a bit (the bitbuffer InitWithSegment part), but the loop is still an issue.
  4. Merge/pack SEQ_{LITERAL, MATCH}_LENGTH_BASELINES and SEQ_{LITERAL, MATCH}_LENGTH_EXTRA_BITS arrays into one array. There was a separate s_load_b32 with its own waitcnt, and it also reduces duplicate indexing and OOB-handling code the compiler emits. Navi31 XTX SetStablePowerState duration for ~146MB "insects" archive: 1.44 ms -> 1.35 ms. See commit in the PR for the before/after ISA.
  5. Raw-buffer-ize all backwards bitbuffers for consistency with HuffmanStream, and because ByteAddressBuffer is more friendly to SMEM (SMEM doesn't do stride-mul itself), though that last point may only be relevant after future compiler changes.

Diff is best viewed when ignoring whitespace changes.

This PR should probably be squashed if/when merged; I didn't force-push anything that would change commit hashes.

An input zst file of ~146MB does a Dispatch(15299, 1, 1). On an 7900 XTX (RDNA3) with SetStablePowerState, duration change is 1.811 ms -> 1.441 ms.

There could perhaps be more perf testing for this, but A nearby #ifdef suggests a threadgroup size of 32 is also better on RDNA2 XBOX.

Current IHV-compiler output seems lacking in some areas, if that is improved the preference of threadgroup size may change.
…bail after LDS stores. Also remove the condition for the GroupMemoryBarrierWithGroupSync. The IHV compiler should not emit a barrier if it isn't needed and be capable or removing empty blocks. s_barrier also behaves like an s_nop for single-wave-threadgroups on AMD.

This does not apply to single-wave-threadgroups and DeviceMemoryBarrierWithGroupSync.
…ECODE_SEQ twice. The benefit is less compiled code for easier performance analysis. Performance seems about the same. The ZSTGPU_DECODE_SEQ macro could be removed, but its kept for any future potential experimentation (and to reduce the diff when ignoring whitespace).
…d/ran this at one point before and it was faster for the loop to only have 1 termination condition. Added a comment near the #if 0 about raw buffers.
@Jonathan-Weinstein-AMD Jonathan-Weinstein-AMD marked this pull request as draft February 19, 2026 20:08
… current AMD GPU compiler from not emitting scalar instructions. This tidies up the ISA for DecompressSequences_LdsFseCache a bit (the bitbuffer InitWithSegment part), but the loop is still an issue.
…AL, MATCH}_LENGTH_EXTRA_BITS arrays into one array. There was a separate s_load_b32 with its own waitcnt, and it also reduces duplicate indexing and OOB-handling code the compiler emits. Navi31 XTX SetStablePowerState duration for ~146MB "insects" archive: 1.44 ms -> 1.35 ms

Before (some instructions are omitted):
  s_getpc_b64   s[24:25]
  s_add_u32     s24, s24, lit(0x00000514)
  s_addc_u32    s25, s25, 0
  ---------------------------------------
  s_lshl_b32    s15, s15, 2
  s_add_i32     s26, s15, lit(0x00000200)
  s_min_u32     s26, s26, lit(0x000002d4)
  s_load_b32    s26, s[24:25], s26
  s_waitcnt     lgkmcnt(0)
 ---------------------------------------
  s_lshl_b32    s27, s27, 2
  s_addk_i32    s27, 0x0090
  s_min_u32     s27, s27, lit(0x000002d4)
  s_load_b32    s27, s[24:25], s27
  s_waitcnt     lgkmcnt(0)
  ---------------------------------------
  s_addk_i32    s15, 0x0120
  s_lshl_b32    s6, s6, 2
  s_min_u32     s15, s15, lit(0x000002d4)
  s_min_u32     s6, s6, lit(0x000002d4)
  s_load_b32    s15, s[24:25], s15
  s_load_b32    s6, s[24:25], s6
  s_waitcnt     lgkmcnt(0)

After (some instructions are omitted):
  s_getpc_b64   s[24:25]
  s_add_u32     s24, s24, lit(0x000004fc)
  s_addc_u32    s25, s25, 0
  ---------------------------------------
  s_lshl_b32    s15, s15, 2
  s_addk_i32    s15, 0x0090
  s_min_u32     s15, s15, lit(0x00000164)
  s_load_b32    s15, s[24:25], s15
  s_waitcnt     lgkmcnt(0)
  v_and_b32     v12, s15, 31 // ignore VALU now
  v_lshrrev_b32  v14, 5, s15 // ignore VALU now
  ---------------------------------------
  s_lshl_b32    s6, s6, 2
  s_min_u32     s6, s6, lit(0x00000164)
  s_load_b32    s6, s[24:25], s6
  s_waitcnt     lgkmcnt(0)
  v_and_b32     v7, s6, 31 // ignore VALU now
  v_lshrrev_b32  v7, 5, s6 // ignore VALU now
…do a built-in stride-mul, so StructuredBuffer isn't as friendly to SMEM loads. The current AMD driver compiler is still generating buffer_load (VMEM, no s_ prefix) in the main DecompressSequences_LdsFseCache loop, but hopefully that'll be fixed in the future (if this pass still is still structured similarly).

Deleted ZSTDGPU_USE_REVERSED_BIT_BUFFER_OFFSET so there are less paths. I don't see a benefit in what it is now, looks like the same amount of ALU, but it needs an extra field.

I did this to ensure (more) testing for zstdgpu_Backward_BitBuffer (no _V0):
```diff
diff --git a/zstd/zstdgpu/zstdgpu_shaders.h b/zstd/zstdgpu/zstdgpu_shaders.h
index f3f61cf..86f7885 100644
--- a/zstd/zstdgpu/zstdgpu_shaders.h
+++ b/zstd/zstdgpu/zstdgpu_shaders.h
@@ -3575,8 +3575,8 @@ static void zstdgpu_ShaderEntry_DecompressSequences_LdsFseCache(ZSTDGPU_PARAM_IN
         #   error `ZSTDGPU_BACKWARD_BITBUF` must not be defined.
         #endif

-        zstdgpu_Backward_BitBuffer_V0 bitBuffer;
-        #define ZSTDGPU_BACKWARD_BITBUF(method) zstdgpu_Backward_BitBuffer_V0_##method
+        zstdgpu_Backward_BitBuffer bitBuffer;
+        #define ZSTDGPU_BACKWARD_BITBUF(method) zstdgpu_Backward_BitBuffer_##method
         ZSTDGPU_BACKWARD_BITBUF(InitWithSegment)(bitBuffer, srt.inCompressedData, seqRef.src);

         #define ZSTDGPU_INIT_FSE_STATE(name)                                                                    \
@@ -3671,7 +3671,7 @@ static void zstdgpu_ShaderEntry_DecompressSequences_LdsFseCache(ZSTDGPU_PARAM_IN
             stateMLen = nstateMLen + restMLen;
             stateOffs = nstateOffs + restOffs;
         }
-        ZSTDGPU_ASSERT(bitBuffer.hadlastrefill && bitBuffer.bitcnt == 0);
+        // ZSTDGPU_ASSERT(bitBuffer.hadlastrefill && bitBuffer.bitcnt == 0);
         #undef ZSTDGPU_BACKWARD_BITBUF
     }
```
…NA."

This reverts commit e1fe0f9.

With the `if (threadId != 0) { return; }` change, the performance difference between TG size of 32 (always wave32) and TG size of 64 (probably wave64 on Navi3, and more likely on navi4). In the ~146MB insects archive on Navi31 TG size of 32 is maybe still slightly faster (by 5-10 us with SetStablePowerState), but for other data sets that maybe have bigger FSE tables and with less sequences maybe TG size of 64 could fare better.
@Jonathan-Weinstein-AMD Jonathan-Weinstein-AMD changed the title zstd/DecompressSequences_LdsFseCache: Prefer TG size of 32 instead of 64 and some refactoring. zstd: Misc optimizations and refactoring in DecompressSequences_LdsFseCache. Feb 20, 2026
@Jonathan-Weinstein-AMD Jonathan-Weinstein-AMD marked this pull request as ready for review February 20, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant