Apache CarbonData community is pleased to announce the release of the
Version 1.5.2 in The Apache Software Foundation (ASF).
CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenario it supports queries on single table with 3PB data (more
than 5 trillion records) with response time less than 3 seconds!
This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.2?
CarbonData 1.5.2 intention was to move more closer to unified analytics. We
want to enable CarbonData files to be read from more engines/libraries to
support various use cases. In this regard we have enhanced and stabilized
Presto features and the following features and improvements.
In this version of CarbonData, more than 68 JIRA tickets related to new
features, improvements, and bugs has been resolved. Following are the
CarbonData CoreSupport Compaction for No-sort Load Segments
During Data loading, if sort scope is set as No-sort, the data loading
performance would increase significantly as the data won't get sorted and
is written as it is received. But this no-sort loading would cause the
query performance to degrade as indexes are not built on these segments.
Compacting these no-sort loaded segments would convert these segments into
sorted segments and thereby improve the query performance as indexes get
generated. The ideal scenario to use this feature is when high speed data
loading is more important than a high query performance till the time the
compaction is not done.
Support Rename of Column Names
Column names can be renamed to reflect the business scenario or
Support GZIP Compressor for CarbonData Files
GZIP compression is supported to compress each page of CarbonData file.
GZIP offers better compression ratio there by reducing the store size. On
the average GZIP compression reduces store size by 20-30% as compared to
Snappy compression. GZIP compression is supported to compress sort temp
files written during data loading. GZIP also has support from hardware.
Hence data loading performance would increase on those machines where GZIP
is supported natively from hardware.
Performance ImprovementsSupport Range Partitioned Sort during data load
Global Sort supported during Data loads ensures the data is entirely sorted
and hence group all the same data to a particular node/machine.This helps
to optimise the Spark scan performance and also increases the concurrency.
The drawback of Global Sort is that is very slow as the data has to be
globally sorted(Heavy shuffle). Local sort on the other hand partitions the
data to multiple nodes/machines and ensure the data local to that
node/machine is sorted. This improves the data loading performance, but
query performance degrades a bit as more Spark tasks will have to be
launched to scan the data. Range sort on the other hand, splits the data
based on the value range and loads using local sort. This give a balanced
performance for both load and query.
Other ImprovementsPresto Enhancements
CarbonData implemented features to better integrate with Presto. Now Presto
can recognise CarbonData as a native format. Many bugs were fixed to
enhance the stability.
Support Map Data Type through DDL
1.5.0 version supported adding Map data type through CarbonData SDK. This
version supports adding Map data type through DDL.
1. If user doesn’t specify sort columns during table creation, default
sort scope is set to no-sort during data loading
2. Default Complex values delimiter value is changed from '*$*','*:*' to
'*\001*' , '*\002*' respectively
3. Inverted Index generation is disabled by default
New Configuration Parameters
Configuration Name Default Value Range
*carbon.table.load.sort.scope *LOCAL_SORT LOCAL_SORT,
NO_SORT, GLOBAL_SORT, BATCH_SORT
*carbon.range.column.scale.factor** 3 1-300 *