Apache CarbonData community is pleased to announce the release of the
Version 1.5.1 in The Apache Software Foundation (ASF).
CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenario it supports queries on single table with 3PB data (more
than 5 trillion records) with response time less than 3 seconds!
This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.1?
CarbonData 1.5.1 intention was to move more closer to unified analytics. We
want to enable CarbonData files to be read from more engines/libraries to
support various use cases. In this regard we have added support to write
CarbonData files from c++ libraries.
CarbonData added multiple optimizations to improve query and compaction
In this version of CarbonData, more than 78 JIRA tickets related to new
features, improvements, and bugs have been resolved. Following are the
CarbonData CoreSupport Custom Column Compressor
Carbondata supports customized column compressor so that user can add their
own implementation of compressor. To customize compressor, user can
directly use its full class name while creating table or setting it to
Performance ImprovementsOptimized Carbondata Scan Performance
Carbondata scan performance is improved by avoiding multiple data copies in
case of vector flow. This is achieved through short-circuit the read and
vector filling, it means fill the data directly to vector after reading the
data from file with out any intermediate copies.
Now row level filter processing is handled in execution engine, only
blocklet and page pruning is handled in CarbonData for vector flow. This is
controlled by property *carbon.push.rowfilters.for.vector *and default it
Optimized Compaction Performance
Compaction performance is optimized through pre-fetching the data while
reading carbon files.
Improved Blocklet DataMap Pruning in Driver
Blocklet DataMap pruning is improved using multi-thread processing in
CarbonData SDKSDK Supports C++ Interfaces for Writing CarbonData files
To enable integration with non java based execution engines, CarbonData
supports C++ JNI wrapper to write the CarbonData files. It can be
integrated with any execution engine and write data to CarbonData files
without the dependency on Spark or Hadoop.
Multi-Thread Read API in SDK
To improve the read performance when using SDK, CarbonData supports
multi-thread read APIs. This enables the applications to read data from
multiple CarbonData files in parallel. It significantly improves the SDK
- Added more CLI enhancements by adding more options.
- Supported fallback mechanism, when offheap memory is not enough then
switch to on heap instead of failing the job
- Supported a separate audit log.
- Support read batch row in CSDK to improve performance.
- Enable Local dictionary by default.
- Make inverted index false by default.
- Sort temp files during data loading are now compressed by default with
Snappy compression to improve IO.
New Configuration Parameters
*carbon.max.driver.threads.for.block.pruning* 4 1-4