Conversation
| * @param codecName the compression codec to use by default | ||
| * @return this builder for method chaining. | ||
| */ | ||
| public Builder withCompressionCodec(CompressionCodecName codecName) { |
There was a problem hiding this comment.
There isn't an existing API to set this? I have to look more closely at the the convention but would withDefaultCompressionCodec make sense?
| BytesInputCompressor compressor = codecFactory.getCompressor(props.getCompressionCodec(path)); | ||
| writers.put( | ||
| path, | ||
| new ColumnChunkPageWriter( |
There was a problem hiding this comment.
Is this copy and paste from other constructors, I wonder if there is some refactoring that can be done to avoid duplication? (I wonder if we should have a ColumnChunkPageWriterBuilder?
| fileEncryptor, | ||
| rowGroupOrdinal); | ||
| ColumnChunkPageWriteStore columnChunkPageWriteStore; | ||
| if (codecFactory != null) { |
There was a problem hiding this comment.
Is it possible to create a default codecFactory to avoid the if/else block below?
| } | ||
|
|
||
| @Test | ||
| public void testPerColumnCompression() throws Exception { |
There was a problem hiding this comment.
does this test anything that the next method does not?
| * @param validating if schema validation should be turned on | ||
| * @param props parquet encoding properties | ||
| */ | ||
| ParquetRecordWriter( |
There was a problem hiding this comment.
It seems like this should be deprecated, and new method without codec passed in should be exposed instead?
| this.codecFactory = new CodecFactory(conf, props.getPageSizeThreshold()); | ||
| // Ensure the default compression codec from ParquetOutputFormat is set in props | ||
| ParquetProperties propsWithCodec = | ||
| ParquetProperties.copy(props).withCompressionCodec(codec).build(); |
There was a problem hiding this comment.
does risk overwriting an already set compression codec?
emkornfield
left a comment
There was a problem hiding this comment.
Took a first pass not very familiar with this code, but I'd also expect potentially more test coverage given the number of classes changed?
Rationale for this change
Issue Raised here: apache/parquet-format#553
The Parquet spec already supports per-column compression, each column chunk stores its own CompressionCodecName in the footer metadata. However, the parquet-java writer API currently forces a single compression codec for all columns in a file. This PR address that gap by exposing per-column compression configuration through the existing ColumnProperty infrastructure.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Two new public APIs are introduced:
cc @julienledem @emkornfield