zstd: Misc optimizations and refactoring in DecompressSequences_LdsFseCache.#73
Open
Jonathan-Weinstein-AMD wants to merge 9 commits intomicrosoft:developmentfrom
Conversation
An input zst file of ~146MB does a Dispatch(15299, 1, 1). On an 7900 XTX (RDNA3) with SetStablePowerState, duration change is 1.811 ms -> 1.441 ms. There could perhaps be more perf testing for this, but A nearby #ifdef suggests a threadgroup size of 32 is also better on RDNA2 XBOX. Current IHV-compiler output seems lacking in some areas, if that is improved the preference of threadgroup size may change.
…bail after LDS stores. Also remove the condition for the GroupMemoryBarrierWithGroupSync. The IHV compiler should not emit a barrier if it isn't needed and be capable or removing empty blocks. s_barrier also behaves like an s_nop for single-wave-threadgroups on AMD. This does not apply to single-wave-threadgroups and DeviceMemoryBarrierWithGroupSync.
…ECODE_SEQ twice. The benefit is less compiled code for easier performance analysis. Performance seems about the same. The ZSTGPU_DECODE_SEQ macro could be removed, but its kept for any future potential experimentation (and to reduce the diff when ignoring whitespace).
…d/ran this at one point before and it was faster for the loop to only have 1 termination condition. Added a comment near the #if 0 about raw buffers.
… current AMD GPU compiler from not emitting scalar instructions. This tidies up the ISA for DecompressSequences_LdsFseCache a bit (the bitbuffer InitWithSegment part), but the loop is still an issue.
…AL, MATCH}_LENGTH_EXTRA_BITS arrays into one array. There was a separate s_load_b32 with its own waitcnt, and it also reduces duplicate indexing and OOB-handling code the compiler emits. Navi31 XTX SetStablePowerState duration for ~146MB "insects" archive: 1.44 ms -> 1.35 ms Before (some instructions are omitted): s_getpc_b64 s[24:25] s_add_u32 s24, s24, lit(0x00000514) s_addc_u32 s25, s25, 0 --------------------------------------- s_lshl_b32 s15, s15, 2 s_add_i32 s26, s15, lit(0x00000200) s_min_u32 s26, s26, lit(0x000002d4) s_load_b32 s26, s[24:25], s26 s_waitcnt lgkmcnt(0) --------------------------------------- s_lshl_b32 s27, s27, 2 s_addk_i32 s27, 0x0090 s_min_u32 s27, s27, lit(0x000002d4) s_load_b32 s27, s[24:25], s27 s_waitcnt lgkmcnt(0) --------------------------------------- s_addk_i32 s15, 0x0120 s_lshl_b32 s6, s6, 2 s_min_u32 s15, s15, lit(0x000002d4) s_min_u32 s6, s6, lit(0x000002d4) s_load_b32 s15, s[24:25], s15 s_load_b32 s6, s[24:25], s6 s_waitcnt lgkmcnt(0) After (some instructions are omitted): s_getpc_b64 s[24:25] s_add_u32 s24, s24, lit(0x000004fc) s_addc_u32 s25, s25, 0 --------------------------------------- s_lshl_b32 s15, s15, 2 s_addk_i32 s15, 0x0090 s_min_u32 s15, s15, lit(0x00000164) s_load_b32 s15, s[24:25], s15 s_waitcnt lgkmcnt(0) v_and_b32 v12, s15, 31 // ignore VALU now v_lshrrev_b32 v14, 5, s15 // ignore VALU now --------------------------------------- s_lshl_b32 s6, s6, 2 s_min_u32 s6, s6, lit(0x00000164) s_load_b32 s6, s[24:25], s6 s_waitcnt lgkmcnt(0) v_and_b32 v7, s6, 31 // ignore VALU now v_lshrrev_b32 v7, 5, s6 // ignore VALU now
…do a built-in stride-mul, so StructuredBuffer isn't as friendly to SMEM loads. The current AMD driver compiler is still generating buffer_load (VMEM, no s_ prefix) in the main DecompressSequences_LdsFseCache loop, but hopefully that'll be fixed in the future (if this pass still is still structured similarly). Deleted ZSTDGPU_USE_REVERSED_BIT_BUFFER_OFFSET so there are less paths. I don't see a benefit in what it is now, looks like the same amount of ALU, but it needs an extra field. I did this to ensure (more) testing for zstdgpu_Backward_BitBuffer (no _V0): ```diff diff --git a/zstd/zstdgpu/zstdgpu_shaders.h b/zstd/zstdgpu/zstdgpu_shaders.h index f3f61cf..86f7885 100644 --- a/zstd/zstdgpu/zstdgpu_shaders.h +++ b/zstd/zstdgpu/zstdgpu_shaders.h @@ -3575,8 +3575,8 @@ static void zstdgpu_ShaderEntry_DecompressSequences_LdsFseCache(ZSTDGPU_PARAM_IN # error `ZSTDGPU_BACKWARD_BITBUF` must not be defined. #endif - zstdgpu_Backward_BitBuffer_V0 bitBuffer; - #define ZSTDGPU_BACKWARD_BITBUF(method) zstdgpu_Backward_BitBuffer_V0_##method + zstdgpu_Backward_BitBuffer bitBuffer; + #define ZSTDGPU_BACKWARD_BITBUF(method) zstdgpu_Backward_BitBuffer_##method ZSTDGPU_BACKWARD_BITBUF(InitWithSegment)(bitBuffer, srt.inCompressedData, seqRef.src); #define ZSTDGPU_INIT_FSE_STATE(name) \ @@ -3671,7 +3671,7 @@ static void zstdgpu_ShaderEntry_DecompressSequences_LdsFseCache(ZSTDGPU_PARAM_IN stateMLen = nstateMLen + restMLen; stateOffs = nstateOffs + restOffs; } - ZSTDGPU_ASSERT(bitBuffer.hadlastrefill && bitBuffer.bitcnt == 0); + // ZSTDGPU_ASSERT(bitBuffer.hadlastrefill && bitBuffer.bitcnt == 0); #undef ZSTDGPU_BACKWARD_BITBUF } ```
…NA." This reverts commit e1fe0f9. With the `if (threadId != 0) { return; }` change, the performance difference between TG size of 32 (always wave32) and TG size of 64 (probably wave64 on Navi3, and more likely on navi4). In the ~146MB insects archive on Navi31 TG size of 32 is maybe still slightly faster (by 5-10 us with SetStablePowerState), but for other data sets that maybe have bigger FSE tables and with less sequences maybe TG size of 64 could fare better.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Main changes in this PR:
if (threadId != 0) returnafter theZSTDGPU_PRELOAD_FSE_INTO_LDSandGroupMemoryBarrierWithGroupSyncis done, since the rest of the shader is all (what the current compiler does is a different story) scalar. This greatly helps RDNA3 and RDNA4 because they end up picking wave64 for compute shaders often. This change turns the high 32-bits of exec off, making it so the typical second pass of a wave64 VALU/VMEM/LDS instruction is skipped. An input zst file of ~146MB ("insects") does a Dispatch(15299, 1, 1) that on an 7900 XTX (RDNA3) with SetStablePowerState has a duration change of 1.81 ms -> 1.50 ms. Using DecompressSequences_LdsFseCache32 seems to reduce that more (~1.44 ms), but with the other changes in this PR DecompressSequences_LdsFseCache{32,64} are pretty close on Navi31, and maybe the TG size of 64 could be better for other data sets, so keep the PSO variant selection the same now.ZSTGPU_DECODE_SEQis invoked just once. This is mainly for easier performance analysis; the runtime performance does not seem affected.SEQ_{LITERAL, MATCH}_LENGTH_BASELINESandSEQ_{LITERAL, MATCH}_LENGTH_EXTRA_BITSarrays into one array. There was a separate s_load_b32 with its own waitcnt, and it also reduces duplicate indexing and OOB-handling code the compiler emits. Navi31 XTX SetStablePowerState duration for ~146MB "insects" archive: 1.44 ms -> 1.35 ms. See commit in the PR for the before/after ISA.HuffmanStream, and because ByteAddressBuffer is more friendly to SMEM (SMEM doesn't do stride-mul itself), though that last point may only be relevant after future compiler changes.Diff is best viewed when ignoring whitespace changes.
This PR should probably be squashed if/when merged; I didn't force-push anything that would change commit hashes.