Parallelizing / Multithreading with data.table

Question

I have a large data with more than 1 billion observations, and I need to perform some string operations which is slow.

My code is as simple as this:

DT[, var := some_function(var2)]

If I'm not mistaken, data.table uses multithread when it is called with by, and I'm trying to parallelize this operation utilizing this. To do so, I can make an interim grouper variable, such as

DT[, grouper := .I %/% 100]

and do

DT[, var := some_function(var2), by = grouper]

I tried some benchmarking with a small sample of data, but surprisingly I did not see a performance improvement. So my questions are:

Does data.table use multithreading when it's used with by?
If so, is there a condition that multithreading is enabled / disabled?
Is there a way that user can "enforce" data.table to use multithreading here?

FYI, I see that multithreading enabled with half of my cores when I import data.table, so I guess there's no openMP issue here.

Matthew Son · Accepted Answer

I got answers from data.table developers from data.table github.

Here's a summary:

Finding groups of by variable itself is parallelized always, but more importantly,
If the function on j is generic (User Defined Function) then there's no parallelization.
Operations on j is parallelized if the function is (gforce) optimized (Expressions in j which contain only the functions min, max, mean, median, var, sd, sum, prod, first, last, head, tail)

So, it is advised to do parallel operation manually if the function on j is generic, but it may not always guarantee speed gain. Reference

==Solution==

In my case, I encountered vector memory exhaust when I plainly used DT[, var := some_function(var2)] even though my server had 1TB of ram, while data was taking 200GB of memory.

I used split(DT, by='grouper') to split my data.table into chunks, and utilized doFuture foreach %dopar% to do the job. It was pretty fast.

Parallelizing / Multithreading with data.table

Answers (1)

Related Questions