Currently when the data load is done with sort_scope as NO_SORT, then when
those segments are compacted, data is still not sorted and it will hit
The above problem can be solved by sorting the data during compaction and
this helps in query performance.
During busy hours if customer loads data and by default we do sorting , the
loading will be slow. Instead if user makes sort scope as NO_SORT and loads
data, dataloading will be faster. Then when compaction is triggered all the
data will be sorted and written to compacted segment. This will help in
query but compaction performance will degrade and this should be
We can expose a property and by default current flow is taken, and if we
configure property, data will be sorted and compacted segment is written.
performance will be hit for compaction, about the degradation, i will
collect the data and publish. Please give your inputs on this.
It should be faster after compaction with sort, please test and compare the
compaction performance between sort and no sort. please support dynamic
configure in CarbonPropertis for compaction with sort and no sort，
especially their performance has differ greatly.
it is basically if user wants data to be loaded fast, then he will use no
sort right. so during compaction if we sort the data and load to new
compacted segment then the complete data will be sorted. so it helps in
query performance. I hope i answered your question
What’s your proposal for the corresponding grammar to do that?
Besides, if we only sort after compaction, will it be proper to keep the sort_scope in table level? It should be in segment level in this situation and keep it in table level will confuse the user. How do you consider this?
currently , what i have thought is, if all the loads involved for compaction
are no sort then only we will sort during compaction. So currently we have
table level, that is fine. So if the table has no_sort during compaction it
will be sorted , if local sort it will go to current compaction flow. I
think there can be no confusion.
The scope for this feature is to SORT the data during compaction when the
data is loaded using NO_SORT option during data load operation.
There are few users who want to maximize the data load speed and in turn
fine tune the data further during off peak time (time when system is least
used) by executing Compaction operation.
Sorting will be done during compaction by considering the SORT_COLUMNS
property provided during create table operation.
Please find my response below to your queries.
1. will it be proper to keep the sort_scope in table level? It should be in
segment level in this situation and keep it in table level will confuse the
Yes. This is expected as feature is to specifically support sorting of data
during compaction so data load operation is expected to be done with
SORT_SCOPE as NO_SORT. But we cannot have the control over it so if multiple
data load operations are done with different sort_scope then during
compaction we have to take care of sorting only the segment which is not
sorted, remaning segments should go only through merge sort flow.
After compaction operation all the data will be written using local sort.
So what’s your proposal for the grammar of this feature?
Do you want carbon to do it silently without any configurations or choices from user?
What I am concerned about is that the performance of compaction. If user use auto-compaction, the loading will be more delayed if we do compaction using localsort.
Moreover, if user can bear the time to compaction, will he want it to be global-sort or others?
The 2 points above are the reason that I want to know about the grammar for this feature.