Apache CarbonData community is pleased to announce the release of the
Version 1.5.0 in The Apache Software Foundation (ASF).
CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookups on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenarios, it supports queries on a single table with 3PB data
(more than 5 trillion records) with response time less than 3 seconds!
This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.0?
CarbonData 1.5.0 intention was to move closer to unified analytics. We want
to enable CarbonData files to be read from more engines/libraries to
support various use cases. In this regard, we have added support to read
CarbonData files from c++ libraries. Additionally, CarbonData files can be
read using Java SDK, Spark FileFormat interface, Spark, Presto.
CarbonData added multiple optimisations to reduce the store size so that
query can take advantage of lesser IO. Several enhancements have been made
to Streaming support from CarbonData.
In this version of CarbonData, more than 150 JIRA tickets related to new
features, improvements, and bugs have been resolved. Following are the
Ecosystem IntegrationSupport Spark 2.3.2 ecosystem integration
Now CarbonData supports Spark 2.3.2
Spark 2.3.2 has many performance improvements in addition to critical bug
fixes. Spark 2.3.2 has many improvements related to Streaming and
unification of interfaces. In 1.5.0 version, CarbonData integrated with
Spark so that future versions of CarbonData can add enhancements based on
Spark's new and improved capabilities.
Support Hadoop 3.1.1 ecosystem integration
Now CarbonData supports Hadoop 3.1.1 which is the latest and stable hadoop
version and supports many new features.(EC, federation cluster etc.)
LightWeight Integration with Spark
CarbonData now supports the Spark FileFormat Data Source APIs so that
CarbonData can be integrated to Spark as an external file source. This
integration helps to query CarbonData tables from SparkSession, it also
helps applications which needs standard compliance's with respect to
Spark data source APIs support file format level operations such as read
and write. CarbonData’s enhanced features namely IUD, Alter, Compaction,
Segment Management, Streaming will not be available to use when CarbonData
is integrated as a Spark’s data source through the data source API.
CarbonData CoreAdaptive Encoding for Numeric Columns
CarbonData now supports adaptive encoding for numeric columns. Adaptive
encoding helps to store each data of a column as a delta of Min/Max value
of that column, there by reducing the effective bits required to store the
value. This results in smaller store size there by increasing the query
performance due to lesser IO. Adaptive encoding for dictionary columns is
already supported from version 1.1.0, now supports for all numeric columns.
Performance improvement measurement is not complete in 1.5.0. The results
will be published along with 1.5.1 release.
Configurable Column Size for Generating Min/Max
CarbonData generates Min/Max index for all columns and uses it for
effective pruning of data while querying. Generating Min/Max for columns
having longer width(like address column) will lead to increased storage
size, increased memory footprint there by reducing the query performance.
Moreover filters are not applied on such columns and hence there is no
necessity of generating the indexes; or the filters on such columns are
very minimal and would be wise to have lower query performance in such
scenarios, rather than affecting the over all performance for other filter
scenarios due to increased index size. CarbonData now supports configuring
the limit of the column width(in terms of characters) beyond which the
Min/Max generation would be skipped.
By Default the Min/Max is generated for all string columns. Users who are
aware of they data schema and know the columns which have more number of
characters and on which filters will not be applied upon, can configure the
exclude such columns; or the maximum length of characters upto which the
Min/Max can be generated can be specified so that CarbonData would skip
Min/Max index generation if the column character length crosses this
configured threshold. By default string columns with more than 200 bytes
are skipped from Min/Max index generation. In Java each character occupies
2 characters.Hence column length greater than 100 characters are skipped
from Min/Max index generation.
Support for Map Complex Data Type
CarbonData has integrated map complex data type support. Map data schema
defined in Avro can be stored into CarbonData tables. Map data types help
for an efficient look up of data. Adding Map complex data type support
CarbonData helps the user to directly store their Avro data without writing
the conversion logic into CarbonData supported data types.
Support for Byte and Float Data Types
CarbonData supports Byte and Float data types so that the data types
defined in Avro schema can be stored into CarbonData tables. Columns of
Byte data type can be included in sort columns.
ZSTD compression is supported to compress each page of CarbonData file.
ZSTD offers better compression ratio there by reducing the store size. On
the average ZSTD compression reduces store size by 20-30% . ZSTD
compression is supported to compress sort temp files written during data
CarbonData SDKSDK Supports C++ Interfaces to read CarbonData files
To enable integration with non java based execution engines, CarbonData
supports C++ reader to read the CarbonData files. These readers can be
integrated with any execution engine and queried for data stored in
CarbonData tables without the dependency on Spark or Hadoop.
*Multi-Thread Safe W**riter** API in SDK *
To improve the write performance when using SDK, CarbonData supports
multi-thread safe writer APIs. This enables the applications to write data
to a single CarbonData file in parallel. Multi-Thread safe writers help in
generating bigger CarbonData files there by avoiding the small files
problem faced in HDFS.
StreamingStreamSQL supports Kafka as streaming source
StreamSQL DDL now supports specifying Kafka as streaming source. With this
support, users need not write custom application to ingest streaming data
from Kafka into CarbonData. They can easily do so by specifying 'format' as
'kafka' in CREATE TABLE DDL.
StreamSQL supports Json records from Kafka/socket streaming sources
Now StreamSQL can accept Json as data format in addition to csv. This helps
the users not to write their custom applications to ingest streaming data
Min/Max Index Support for Streaming Segment
CarbonData supports generating Min/Max indexes for Streaming segment so
that filter pruning is more efficient and increases the query performance.
CarbonData is able to serve the queries faster due to the Min/Max indexes
built at various levels. Adding Min/Max index support to Stream segment
will enable CarbonData to serve the queries with same performance as other
Debugging and Maintenance enhancementsData Summary Tool
CarbonData supports a CLI tool to retrieve the statistical information from
each CarbonData file.It can list various parameters like number of
blocklets, pages, encoding types, Min/Max indexes. This tool is useful to
identify the reason for a block/blocklet selection during pruning.Looking
at the Min/Max indexes, user can easily decide the size of blocklet so as
to avoid false positives. Scan performance benchmarking is supported from
this tool. User can use this to identify the time taken to scan each
blocklet for a particular column.
- Code optimized to avoid unnecessary listing of CarbonData files stored
in S3, resulting in S3 performance enhancement.
- Now SDK supports Varchar columns greater than 32K characters.
- Now you can decide the sort_scope during CarbonData write operation
- Memory footprint for Dataloading with Local dictionary is optimized to
consume approximately 2x times that of DataLoading with Global Dictionary.
In earlier versions, the memory footprint was 10x.
- SDK APIs are more simplified for easy accommodation of new input types
(for example, CSV, JSON, and so on) without modifying much of business
- Bloom Filter quality has been further enhanced by fixing various bugs
related to bloom index creation and clean up. Now bloom filter scan for In
Expressions have be optimised to scan once.
- MV datamap quality has been enhanced by fixing numerous bugs related
to MV selection logic and by supporting various sql constructs. Examples
have been added to explain the usage of MV.
- Compaction bug of ignoring subsequent segments from compacting when
configuration is of (X,1) is handled.
- SHOW SEGMENT command now displays the size of each segment. This helps
the user to perform maintenance operations like compaction, backup.
- SDK has been enhanced to support long_string_columns, Map complex data
Behavioral ChangesRenaming of Table Names
Earlier renaming of CarbonData table used to rename in Hive metastore as
well as folder name on HDFS. Now, it will be renamed only in Hive metastore.