Support SI at Segment level

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Support SI at Segment level

Nihal
Hi all,

Currently, if the parent(main) table and SI table don’t have the same valid
segments then we disable the SI table. And then from the next query onwards,
we scan and prune only the parent table until we trigger the next load or
REINDEX command (as these commands will make the parent and SI table
segments in sync). Because of this, queries take more time to give the
result when SI is disabled.

To solve this problem we are planning to support SI at the segment level. It
means we will not disable SI if the parent and SI table don’t have the same
segments, while we will do the pruning on Si for all valid segments, and for
the rest of the segments, we will do the pruning on main/parent table.


At the time of pruning with the main table in TableIndex.prune, if SI exists
for the corresponding filter then all segments which are not present in the
SI table will be pruned on the corresponding parent table segment.

Please let me know your thought and input about the same.

Regards
Nihal kumar ojha



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Support SI at Segment level

David CaiQiang
hi Nihal,
My thoughts as follows.
1. segment level's differences with table level
  a) pushdown SI into CarbonDataSourceScan/Relation and avoid rewriting the
SQL plan
  b) different segments will have different SI, so different segments maybe
choose the different SI
 

2. data loading/compaction/update/delete/merge
  a) the main table can update tablestatus metadata entry to success status
before SI loading
  b) if SI is disabled, no need to do SI loading; if SI is enabled, it can
do SI loading.

3. query
  a) reading the data of SI table could be on the executor side; reading the
index of SI table could be on the driver side.
  b) performance: now the system uses a distributed job (groupBy and Join
query) to collect the positionIDs of the result rows; if  TableIndex.prune
use a single thread will have performance issue.
  c) when the table has multiple SI tables, positionId join of table level
shoulde be converted to segment level join.



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai
Reply | Threaded
Open this post in threaded view
|

Re: Support SI at Segment level

akashrn5
In reply to this post by Nihal
Hi Nihal,

Thanks for bringing this up. It's an important feature to leverage SI at the
small segment level also.

Already a work is being done on making SI to prune at data map interface, so
your design should be aligned with that.
So better to check the SI as a data map design first and then make a design
for this, then it will be a clear picture to review and start the work, else
two designs will contradict each other.

Thanks,

Regards,
Akash R Nilugal



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Support SI at Segment level

Nihal
In reply to this post by Nihal
Hi,
   Thanks for the input.

   As already a work is going on to support SI to prune as data map
interface (without SQL plan rewrite), This will be handled with help of some
carbon property and we are not going to remove the current design (SI
support with SQL plan rewrite).

   So first we are focusing on leveraging SI to segment level with SQL plan
rewrite. Please go through  this design document
<https://docs.google.com/document/d/1q1UIrMO4KGZuBICrixrv4JsbrblATSQVuYY0IAKxWn0/edit>  
and give your input or suggestion.

https://docs.google.com/document/d/1q1UIrMO4KGZuBICrixrv4JsbrblATSQVuYY0IAKxWn0/edit

Regards
Nihal kumar ojha



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Support SI at Segment level

akashrn5
In reply to this post by akashrn5
Hi,

+1 for the feature. This is very important to improve query perf instead of
waiting for SI and main table to e always in sync.

I have reviewed the doc and given comments, please handle and please discuss
with @venu Si as datamap feature to be inline as informed earlier.

P.S: This design should be later handled for the SI as datamap flow also,
now its just being handled for existing flow.

Thanks,

Regards,
Akash R



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Support SI at Segment level

Mahesh Raju Somalaraju
In reply to this post by Nihal
Hi,

+1 for the feature.
It will make the query faster.

1) With design discussion about the feature(SI to prune as a data frame)
has one property to set.
  If the data engine wants to use SI as datamap then need to set. if not
set then it will use plan re-write flow.

  So we have to handle this feature in two cases. Can you please check and
update the design as per this?

References:
SI to prune as a data frame
https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing

Thanks & Regards
Mahesh Raju Somalaraju

On Wed, Feb 17, 2021 at 4:05 PM Nihal <[hidden email]> wrote:

> Hi all,
>
> Currently, if the parent(main) table and SI table don’t have the same valid
> segments then we disable the SI table. And then from the next query
> onwards,
> we scan and prune only the parent table until we trigger the next load or
> REINDEX command (as these commands will make the parent and SI table
> segments in sync). Because of this, queries take more time to give the
> result when SI is disabled.
>
> To solve this problem we are planning to support SI at the segment level.
> It
> means we will not disable SI if the parent and SI table don’t have the same
> segments, while we will do the pruning on Si for all valid segments, and
> for
> the rest of the segments, we will do the pruning on main/parent table.
>
>
> At the time of pruning with the main table in TableIndex.prune, if SI
> exists
> for the corresponding filter then all segments which are not present in the
> SI table will be pruned on the corresponding parent table segment.
>
> Please let me know your thought and input about the same.
>
> Regards
> Nihal kumar ojha
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Support SI at Segment level

Ajantha Bhat
+1 for this proposal.

But the other ongoing requirement (
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Presto-Queries-leveraging-Secondary-Index-td105291.html)
is dependent on *isSITableEnabled*
so, better to wait for it to finish and redesign on top of it.

Thanks,
Ajantha

On Tue, Mar 23, 2021 at 1:03 PM Mahesh Raju Somalaraju <
[hidden email]> wrote:

> Hi,
>
> +1 for the feature.
> It will make the query faster.
>
> 1) With design discussion about the feature(SI to prune as a data frame)
> has one property to set.
>   If the data engine wants to use SI as datamap then need to set. if not
> set then it will use plan re-write flow.
>
>   So we have to handle this feature in two cases. Can you please check and
> update the design as per this?
>
> References:
> SI to prune as a data frame
>
> https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing
>
> Thanks & Regards
> Mahesh Raju Somalaraju
>
> On Wed, Feb 17, 2021 at 4:05 PM Nihal <[hidden email]> wrote:
>
> > Hi all,
> >
> > Currently, if the parent(main) table and SI table don’t have the same
> valid
> > segments then we disable the SI table. And then from the next query
> > onwards,
> > we scan and prune only the parent table until we trigger the next load or
> > REINDEX command (as these commands will make the parent and SI table
> > segments in sync). Because of this, queries take more time to give the
> > result when SI is disabled.
> >
> > To solve this problem we are planning to support SI at the segment level.
> > It
> > means we will not disable SI if the parent and SI table don’t have the
> same
> > segments, while we will do the pruning on Si for all valid segments, and
> > for
> > the rest of the segments, we will do the pruning on main/parent table.
> >
> >
> > At the time of pruning with the main table in TableIndex.prune, if SI
> > exists
> > for the corresponding filter then all segments which are not present in
> the
> > SI table will be pruned on the corresponding parent table segment.
> >
> > Please let me know your thought and input about the same.
> >
> > Regards
> > Nihal kumar ojha
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Support SI at Segment level

vikramahuja1001
+1 on this.
Agree with Ajantha on this.

Vikram Ahuja
Reply | Threaded
Open this post in threaded view
|

Re: Support SI at Segment level

Nihal
Hi All,
    Thanks for your input and suggestion.

    For now, we will support leveraging SI to segment level only with SQL
plan rewrite(already mentioned in this thread and design document).

   As a parallel work is going on to support SI as datamap(without plan
rewrite), which will be at table level.
This work is independent of the existing property "isSITableEnabled"
as mentioned in the design doc or  PR 4110
<https://github.com/apache/carbondata/pull/4110>  .
Also, there is no other major conflict or dependency between both designs.
So we can safely handle both the work parallelly.

    We are planning to leverage the datamap SI to the segment level
later(once the PR merged). I will create a separate JIRA ticket to track
this work.


Regards
Nihal kumar ojha



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/