Recently, I compared the performance of full scan between parquet
and carbondata, found that the performance of full scan of carbondata was
worse than parquet.
1. Spark 2.2 + Parquet with Spark 2.2 + CarbonData(master branch)
2. Run on local mode,
3. There are 8 parquet files in one folder, total: 47474456 records, the size of each file is about *170* MB;
4. There are 8 segments in one carbondata table, total: 47474456 records, each segment has one file, the size of each file is about *220 *MB, there are *4 blocklets and 186 pages* in one carbondata file;
5. The data of each parquet file and carbondata file is the same;
6. create table sql:
CREATE TABLE IF NOT EXISTS cll_carbon_small (
ts int ,
fratio int ,
code int ,
STORED BY 'carbondata'
7. test sql:
1). select count(chan),count(fcip),sum(size) from table;
2). select chan,fcip,sum(size) from table group by chan, fcip order by chan, fcip;
I added some time count in code and change the size of CarbonVectorProxy from 4 * 1024 to 32 * 1024, use non-prefetch mode. The time stat (take one test) :
1. BlockletFullScanner.readBlocklet: 169ms;
2. BlockletFullScanner.scanBlocklet: 176ms;
3. DictionaryBasedVectorResultCollector.collectResultInColumnarBatch: 7958ms, in this part, it takes about 200-300ms to handle each blocklet, so it takes totally about 1s to handle one carbondata file, but in carbon stat log it shows that it takes about 1-2s to handle one carbondata file for SQL1 and
2-3s to handle one file for SQL2;
4. In CarbonScanRDD.internalCompute, the iterator will execute 1464 times, each iterate takes about 8-9ms for SQL1 and 10-15ms for SQL2;
5. The total time of 1-3 steps are almost the same for SQL1 and SQL2;
1. any optimization on DictionaryBasedVectorResultCollector.collectResultInColumnarBatch ?
2. It takes about 1s to handle one carbondata file in my time stat, but actually it takes about 1-2s for SQL1 and 2-3s for SQL2 in Spark ui, why? shuffle? compute?
3. Can it support to configurate the size of CarbonVectorProxy to reduce times of iterate? Default value is 4 * 1024 and iterate executes 11616 times.
If this property is configurable, how do you want to use it?
Does the changing of this property benefit all your queries? If it doesn’t. A system property may be bad to meet all the queries. Then how about a hint for this property?
> On Sep 20, 2018, at 00:02, xm_zzc <[hidden email]> wrote:
> 3. Can it support to configurate the size of CarbonVectorProxy to
> reduce times of iterate? Default value is 4 * 1024 and iterate executes
> 11616 times.
I used SQL1 and SQL2 as test cases and ran on local mode,
when the rowNum of CarbonVectorProxy (actually it's the capacity of
ColumnarBatch) is 4 * 1024 (default):
SQL1: 8s, 9s (run two times), SQL2: 12s, 11s
but when it's 16 * 1024:
SQL1: 6s, 6s, SQL2: 9s, 8s
So the changing of this property benefits my two test cases.