Selection/Projection/Grouping in large datasets

Question

I read the article http://hannes.muehleisen.org/ssdbm2014-r-embedded-monetdb-cr.pdf and was glad to read that data.table performed very well. However, I was surprised that selection, selection/projection, and grouping were so slow for larger documents (1GB and 10 GB). I think data.table is amazing and I find it surprising that it is 5x-10x slower for larger datasets.

I understand that I shouldn't put much stock in micro benchmarks, and I don't. In fact, after reading the article I'm more convinced that using data.table is beneficial because of its consistent syntax and simplicity. I do not only care for raw performance. I'm asking this question because the data.table authors are interested in examining these questions and are great at explaining why (or why not) data.table performs the way it does. This is another reason that I ~~like~~ love using data.table.

Thanks Matt Dowle et al.

Arun · Accepted Answer

Thanks for the link, and the praise. Very interesting read. The three points that seems really impressive to me (out of the many cool things) are:

Embedding database in into R/statistics environment (reverse of the norm).
Bringing the two systems under the same address space.
Converting from primitive types to SEXPs without requiring a copy (/ extra memory).

although these require modifications to source.

On comparisons to data.table however here are some concerns:

They compare against v1.8.10 which is more than a year old. Since then data.table has evolved QUITE A LOT.

Faster and cache efficient MSD based radix sorting (for integers, doubles, characters, integer64). Since data.table uses ordering to find group indices in order to perform aggregations, joins and almost everything else, this means almost all operations are much much faster now.
Implementation of GForce to avoid time spent on evaluating j-expressions for each group which makes adhoc grouping with those functions even faster.
Many many bug fixes and features implemented - memory leak fixes, more memory efficient by avoiding unnecessary copies etc.. Check news.
Faster subsetting (using native implementation), faster binary search (hence faster joins and subsets), and more recently automatic indexing etc..

Also it is not clear what compiler optimisations they used.

To give an idea of the speedup since 1.8.10, have a look at this recent benchmark by Matt.

# data.table 1.9.2    50GB    10,000 groups    < 1 minute    (from Matt's benchmark)
# data.table 1.8.10   10GB       500 groups    ~ 18 minutes! (from their article)

Grouping over 50GB of data with 10,000 groups using data.table takes less than a minute (on 2.5Ghz processor, see detailed specs in the link), where as aggregating 10GB of data with just 500 groups took approximately 18 minutes in their benchmark (on a 3.4Ghz processor).

They don't talk about the cache sizes of their machine in the article, data dimensions, how many columns to group by etc.. (or I have missed to read it from the text).

And there are already some performance fixes since then. Projections will get even faster once this FR is taken care of. It'd be interesting to rerun this benchmark (and maybe more tests) although I can't seem to find a link to the source code in their article.

But again a very good read.

Selection/Projection/Grouping in large datasets

Answers (1)

Related Questions