Skip to content

Comments

Implement per column compression#3396

Open
rahil-c wants to merge 1 commit intoapache:masterfrom
rahil-c:rahil/per-column-compression
Open

Implement per column compression#3396
rahil-c wants to merge 1 commit intoapache:masterfrom
rahil-c:rahil/per-column-compression

Conversation

@rahil-c
Copy link

@rahil-c rahil-c commented Feb 16, 2026

Rationale for this change

Issue Raised here: apache/parquet-format#553

The Parquet spec already supports per-column compression, each column chunk stores its own CompressionCodecName in the footer metadata. However, the parquet-java writer API currently forces a single compression codec for all columns in a file. This PR address that gap by exposing per-column compression configuration through the existing ColumnProperty infrastructure.

What changes are included in this PR?

  • ParquetProperties: Added ColumnProperty following the same pattern used for dictionary encoding, bloom filters.
  • ColumnChunkPageWriteStore: Added a new constructor that accepts CompressionCodecFactory + ParquetProperties
  • InternalParquetRecordWriter: Added a new constructor accepting CompressionCodecFactory instead of a single BytesInputCompressor.
  • ParquetWriter: Added withCompressionCodec(String,CompressionCodecName) builder method. Updated the core constructor to pass the CompressionCodecFactory through to the writer stack.
  • ParquetOutputFormat: Added ColumnConfigParser entry so per-column compression can be configured via Hadoop config keys (parquet.compression#=CODEC).
  • ParquetRecordWriter: Updated to pass CompressionCodecFactory to InternalParquetRecordWriter.

Are these changes tested?

  • Added test within this pr

Are there any user-facing changes?

Two new public APIs are introduced:

  ParquetWriter.builder(path)
      .withCompressionCodec(CompressionCodecName.SNAPPY)
         // default for all columns
      .withCompressionCodec("embeddings",
  CompressionCodecName.UNCOMPRESSED)  // per-column override
      .build();

  Hadoop configuration (new key pattern):
  parquet.compression#<column_path>=<CODEC_NAME>

cc @julienledem @emkornfield

@rahil-c rahil-c marked this pull request as ready for review February 17, 2026 02:44
@rahil-c rahil-c changed the title [draft] Implement per column compression Implement per column compression Feb 17, 2026
* @param codecName the compression codec to use by default
* @return this builder for method chaining.
*/
public Builder withCompressionCodec(CompressionCodecName codecName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't an existing API to set this? I have to look more closely at the the convention but would withDefaultCompressionCodec make sense?

BytesInputCompressor compressor = codecFactory.getCompressor(props.getCompressionCodec(path));
writers.put(
path,
new ColumnChunkPageWriter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this copy and paste from other constructors, I wonder if there is some refactoring that can be done to avoid duplication? (I wonder if we should have a ColumnChunkPageWriterBuilder?

fileEncryptor,
rowGroupOrdinal);
ColumnChunkPageWriteStore columnChunkPageWriteStore;
if (codecFactory != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to create a default codecFactory to avoid the if/else block below?

}

@Test
public void testPerColumnCompression() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this test anything that the next method does not?

* @param validating if schema validation should be turned on
* @param props parquet encoding properties
*/
ParquetRecordWriter(
Copy link
Contributor

@emkornfield emkornfield Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should be deprecated, and new method without codec passed in should be exposed instead?

this.codecFactory = new CodecFactory(conf, props.getPageSizeThreshold());
// Ensure the default compression codec from ParquetOutputFormat is set in props
ParquetProperties propsWithCodec =
ParquetProperties.copy(props).withCompressionCodec(codec).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does risk overwriting an already set compression codec?

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass not very familiar with this code, but I'd also expect potentially more test coverage given the number of classes changed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants