[Discussion] How to configure the unsafe working memory for data loading

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Discussion] How to configure the unsafe working memory for data loading

xuchuanyin
Hi all,
I go through the code and get another formula to estimate the unsafe working
memory. It is inaccurate too but we can open this thread to optimize it.

# Memory Required For Data Loading per Table

## version from Community
(carbon.number.of.cores.while.loading) * (offheap.sort.chunk.size.inmb +
carbon.blockletgroup.size.in.mb + carbon.blockletgroup.size.in.mb/3.5 )

## version from proposal
memory_size_reqiured
 = max(sort_temp_memory_consumption, data_encoding_consumption)
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (TABLE_PAGE_SIZE)}
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (number.of.fields * per.column.page.size + compress.temp.size)}
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (number.of.fields * per.column.page.size + per.column.page.size/3.5)}
 = max{(number.of.cores + 1) * offheap.sort.chunk.size.inmb, number.of.cores
* (number.of.fields * (32000 * 8 * 1.25) + (32000 * 8 * 1.25)/3.5)}

Note:
1.  offheap.sort.chunk.size.inmb is the size for UnsafeCarbonRowPage
2.  column.page.size is the size for ColumnPage
3.  compress.temp.size is for temporay size for compressing using snappy (in
UnsafeFixLengthColumnPage.compress)

## problems of each version
1.  both do not consider the local dictionary which is disabled by default;
2.  both do not consider the in-memory intermediate merge which is disabled
by default;

### for Community version
1. For per loading, the sort-temp procedure finished before the
producer-consumer procedure, so we do not need to accumulate them.
2. During loading in the producer-consumer procedure, #numer.of.cores
TablePages will be generated, this may surpass the
#carbon.blockletgroup.size.in.mb, so just use
#carbon.blockletgroup.size.in.mb may also cause memory shortage especially
when #numer.of.cores TablePages is high.

### for proposed version
1. It roughly uses 8 bytes * 1.25 (factor in our code) to represent a value
size, which is inaccurate. Besides, 32000 is the max record number in one
page especially after adptive page size for longstring and complex is
implemented.
2. We can further decomposite the #per.column.page.size by identifying the
datatype and data length for string columns, but this may be too trivial for
user to calculate. We can also run the data loading once and get the
#TABLE_PAGE_SIZE or #per.column.page.size, this should be accurate.

## for example
number.of.cores = 15
offheap.sort.chunk.size.inmb = 64
number.of.fields = 300

### Community version
memory_size_reqiured
 = 15 * (64MB + 64MB + 64MB/3.5)
 = 2194MB

### proposed version
memory_size_reqiured
 = max{(15 + 1) * 64MB, 15 * (330 * (32000 * 8 * 1.25) + 32000 * 8 * 1.25 /
3.5)}
 = {1073741824, 15 * 108228023}
 = max{1073741824, 1623420343}
 = 1548MB





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/