Here I am to explain the modification of 'Support Zstd as Column
Compressor'(PR2628). Please give your feedback if you have problems.
Zstd is compressor that have higher ratio than Snappy while has similar
compression/decompression speed (litte worse than snappy). This compressor
has been used in other products in our company and is regarded as a
replacement for snappy with higher compression ratio and acceptable
decreasing in decompression.
So we want to introduce Zstd compressor to compressor the column values in
final carbondata file. (The last sentence is meant to distinguish it from
the compressor for sort temp files.)
1. The metadata of the compressor for a column is stored in DataChunk3.
CarbonData defined the compressor in thrift. Previously it only supported
Snappy, so I
1.1 add Zstd in the thrift.
1.2 add ZstdCompressor and update the CompressorFactory
2. For data loading, before the loading starts, Carbondata will get the
compressor from system property file and pass the compressor info to the
next procedures, so that all the pages in all the blocklets in this load
will use the same compressor. This will avoid the problem if we changed the
property in concurrent mode.
For this modification, we will
2.1 add the compressor info in CarbonLoadModel and
2.2 add the compressor as a member for ColumnPage
2.3 add the compressor as an input parameter when creating a ColumnPage
3. For data querying, Carbondata will get the compressor info from
DataChunk3 in the chunk. Then it will use that compressor to decompress the
content. This means that we will
3.1 get the compressor from the dimension/measure chunk during reading
4. For others that use compressor, such as compress the configuration, we
will use snappy just like before. This means we will
4.1 explicitly specify the snappy as the compressor for it
5. For legacy store, it use snappy, so we just
5.1 specify snappy as the compressor while reading the legacy store.
6. For streaming segment, it also compress the (streaming) blocklets.
Because files in streaming segment did not store the compressor info before,
6.1 add the compressor in the FileHeader in thrift file
6.2 During loading for streaming segment, if the stream file already exists,
we will read the compressor info from the FileHeader of the file and reuse
6.3 If the stream file does not exist, we will read the compressor info from
system property and set it to the FileHeader.
6.4 For streaming legacy store, it does not have compressor in the
FileHeader, in this case, we will use snappy to write&read the following
7. For compaction and handoff, since it reuse the read procedure, so no
extra modification has been made for this. And we still
7.1 add test case for it. Please refer to the
8. For extension for other compressors, it's simple to add a new one. Take
LZ4 for example, the following changes are required:
8.1 Add LZ4 in thrift
8.2 Add Lz4Compressor
8.3 Add Lz4Compressor to the compressor factory
As a result of the latest implementation, I store the compressor name in the
thrift and the old enum for compression_codec has been deprecated. This
makes it easier to support other compressors. Take LZ4 for example, the
following changes are required:
1 Implement Lz4Compressor
2 Add Lz4Compressor to the compressor factory as native supported compressor
For the lz4 task,I checked lz4 compressor (lz4-java), and found it needs the decompressed size before decompressing the data. In CarbonData V3 format, we have stored the uncompressed size of data page in ChunkCompressionMeta.total_uncompressed_size in the data file for every page.
So to implement Lz4Compressor in carbon, I think we need to use this information in the file and maybe compressor interface need to changed to add this parameter to unCompressXXX interface so that lz4 can use it.
> 在 2018年9月12日，下午8:35，xuchuanyin <[hidden email]> 写道：
> As a result of the latest implementation, I store the compressor name in the
> thrift and the old enum for compression_codec has been deprecated. This
> makes it easier to support other compressors. Take LZ4 for example, the
> following changes are required:
> 1 Implement Lz4Compressor
> 2 Add Lz4Compressor to the compressor factory as native supported compressor
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >
To work around with LZ4, you can go with your proposal and save&use the
decompress size in the meta.
But I'd like to wrap the LZ4 implementation by
1. adding original size when we return the compressed content
2. extracting the original size when we want to decompress the content.
In this way, we can make the API stable.
Snappy and Zstd both know the decompress size of the content since they stored that size along with the compressed content. But LZ4 didn't do this, you can refer to the issue#26 in the lz4-java github page.
To work around this, You can store the original size in metadata for decompression. But I would like to store the size along with the compressed content to keep the interface stable.
1. While compressing, you can return the compressed content prefixed with original size
2. While decompressing, you can extract the original size and init the dest buffer