Currently rebuilding datamap has some problems in carbondata and I'll explain the problems and possible solutions here in order to fix it.
Note: User can refer to datamap-management.md in repo for the conception of 'deferred-rebuild', 'rebuild'.
`REBUILD DATAMAP datamap_name ON TABLE table_name` is used to refresh a specific datamap.
1. This operation can even be fired on a non-deferred-rebuild datamap, which is not need.
2. `REBUILD` in current implementation will rebuild the whole datamap, which will discard the old datamap storage -- in most of the scenarios, it is not needed. Besides, while generating the new datamap data, we didn't clear up the old data first, which cause rebuild failure.
It seems that currently for all types of datamap in carbondata, only `MV` needs to rebuild explicitly.
Index datamap (inlcuding lucene, bloomfilter) and preaggregate datamap (including timeseries) organize the datamap data by segment which maps to the segment in main table. So we can manage the datamap data in fine granularity:
11. For deferred-rebuild datamap, if we fire `REBUILD DATAMAP` command on it, carbondata will generate datamap data for the segments which does not have the datamap data yet.
12. If all the segments already have datamap data, this command will return immediately.
13. If this datamap is non-deferred-rebuild, this command will return with error message.
14. In case of concurrent rebuilding, we will block concurrent data rebuilding for one datamap. A lock will be used to achieve this.
For MV datamap, it seems that by default it is deferred-rebuild by default. And the structure of datamap data is different from other datamaps. We will leave it as it is, which means user will explicitly rebuild datamap for it, we only have to ensure:
21. Since MV datamap is by default deferred-rebuild, `WITH DEFERRED REBUILD` is not needed for MV datamap, or we should explicit specify `WITH DEFERRED REBUILD` while creating MV datamap. I'd preferred to the former.
22. Block concurrent rebuilding for one MV datamap.
The last one:
31. Since deferred-rebuild is also a datamap property, how about letting the user specify it explicity in DMPROPERTIES?
REBUILD DATAMAP is implemented only for the full refresh, not done for
incremental data loading, that's why it tries to refresh all the segments
irrespective of it is already built or not. We are planning for the
incremental rebuilding of datamap in the next version.
I feel we can block creating datamap with deferred rebuild and rebuilding of
datamap for index datamaps in current code as we generate index online while
doing the loading of data. We can unblock the rebuild feature for index
datamap after we finish the incremental rebuilding of datamap framework.