Currently in data management scenarios(Data Loading,Segements Compaction
.etc) there exist some data deletion actions. And these actions are
dangerous because they are written in different place and some corner case
will cause data deletion accidently.
In the final step, it will delete the data file which are moved to the
3. Delete temporary files
In default setting, in loading process, CarbonData will write to temp file
first and copy to target path in the end of loading. This method will delete
Data Deletion Hotfix in Loading Process
By analysing the deletion actions during the loading process, we are going
to make some modification to the loading flow deletion to keep data being
deleted by accident.
There are some step to fix the problem:
(1) Replace the stale cleaning function by CleanFile actions.
(2) Ignoring the segments which status are INSERT_IN_PROGREE and
INSERT_OVERWRITE_IN_PROGRESS, bacause the loading progress might take a long
time in a high concurrent situation. This two kind of segments will leave to
be deleted by the command of CleanFiles. Besides, there will a recycle bin
to store the deleted files temporaryly, users can find their deleted
segments at recycle bin.
I agree to take a hotfix for data deletion in loading and compaction flow,
Deleting the INSERT_IN_PROGERSS and INSERT_OVERWRITE_IN_PROGRESS is a
dangerous activity, so these two kinds of segments should not be
As for MARKED_FOR_DELETE and COMPACTED status segments, these are stale
segments, but we can keep them in the file system until the user/admin calls
clean file action manually. Since the deletion requires the precision of
the table status.
So my opinion is to remove all the automatic clean steps in
loading/compaction flow first to protect the data from being deleted