Fagui Curtain
Fagui Curtain

Reputation: 1917

data.table and hash -- speed and flexibility to handle multiple values per key

I have 2 questions:

  1. Is hash faster than data.table for Big Data?
  2. How can I deal with multiple values per key, if I want to use a hash-based approach?

I looked at the vignette of the related packages and Googled some potential solutions, but I'm still not sure about the answers to the questions above.

Considering the following post,

R fast single item lookup from list vs data.table vs hash

it seems that a single lookup in a data.table object is actually quite slow, even slower than in a list in Base R?

However a lookup using a hash object from hash is very speedy, based on this benchmark -- is that accurate?

However, it looks like the object hash is handling only unique keys?

In the following only 2 (key,value) pairs are created.

library(hash)
> h <- hash(c("A","B","A"),c(1,2,3))
> h
<hash> containing 2 key-value pair(s).
  A : 3
  B : 2

So, if i have a table with (key,values) where a key can have different values, and i want to do a (quick) lookup for the values corresponding to this key, what is the best object/data structure in R to do that ?

Can we still use the hash object or is data.table the most appropriate in this case ?

Let's say we are in the context of dealing a problem with very large tables, otherwise this discussion is irrelevant.

Related link: http://www.r-bloggers.com/hash-table-performance-in-r-part-i/

Upvotes: 2

Views: 1252

Answers (1)

Hack-R
Hack-R

Reputation: 23211

You're referring to the question in the SO post that you refer to, not the answer.

As you'll see in the answer there, the results you get from the benchmark can change a lot depending on how you utilize data.table, or any given big data package.

You are correct that with the simplest implementation of hash() each key has 1 value. There are, of course, work-arounds for this. One would be to have a value which is a string and append the string with your multiple values:

h <- hash(c("Key 1","Key 2","Key 3"),c("1","2","1 and 2"))
h
<hash> containing 3 key-value pair(s).
  Key 1 : 1
  Key 2 : 2
  Key 3 : 1 and 2

Another map be to use a hash table via a hashed environment in R, or perhaps via hashmap().

I do not know that there is a single, definitive proof that hash or data.table will always be faster. It could always vary by your use case, data, and how you implement them in your code.

In general, I'd say that data.table might be a more common solution if your use case does not involve a true key-value pair, and no work-around would be needed to address the issue of multiple values per key.

Upvotes: 1

Related Questions