[DISCUSSION] Geo spatial index algorithm improvement and UDFs enhancement
Now carbondata supports geo spatial index and one query UDF 'InPolygon'.
We plan to optimize the Spatial index feature with three points:
1 reduce the parameters of table properties when creating geo table;
2 add more UDFs and support more complex query scenario;
3 allow user to define the spatial index when 'LOAD' and 'INSERT INTO', and
carbon will still generated the value of spatial index column internally
when user does not give.
I have added an initial v1 design document 'CarbonData Spatial Index Design
Doc.docx' and UDF interface design document 'Carbon Geo UDF Enhancement
Interface Design.docx', please check and give comments/inputs/suggestions.
Re: [DISCUSSION] Geo spatial index algorithm improvement and UDFs enhancement
Hi Shen Jiayu,
It is an interesting feature, thanks for proposing this.
+1 from my side for high-level design,
I have few suggestions and questions.
a) Better to separate new UDF, utility UDF PR from algorithm improvement PR
for ease of review and maintainability.
b) Union, intersection, and diff of polygons can be computed during the
filter expression creation and can send the final polygon coordinates as
one range filter to carbon.
c) About algorithm improvement, I saw that you have removed a few
parameters like ‘minLongitude’, ‘maxLongitude’, ‘minLatitude’,
‘maxLatitude’. Anything else changed, can you describe more about what kind
of changes done to improve the algorithm?
d) Please capture the performance results due to algorithm changes with and
without these changes.
e) You have also mentioned supporting Geohash column from user during load.
This case no need to configure any spatial index properties in table
properties right ?
Just few points/queries :
1. Util UDFs seem to take origin latitude and grid size as argments as well.
Shall we inherit them from table specified in the query during query
processing? Probably can avoid invalid/inconsistent origin latitude and grid
size values given as UDF arguments(i.e., not same values as in
2. Regarding the point - *"Allowing flexibility to user to specify the
spatial index value when 'LOAD' and 'INSERT INTO' without generating it
implicitly based on the configured table properties(i.e., grid size, origin
Wouldn't the query results vary when user configured spatial indexes are
different than that of generated ones ? Also when insert into target_table
select * from source_table, both source_table and target_table may not
necessarily have the same geo specific tableproperties. May need to generate
the spatial index value based on the target_table properties.
3. After this algorithm improvements, polygon query result match with the
current version ? If no, suggest to capture the difference in the doc.