Reputation: 1917
I have 2 questions:
hash
faster than data.table
for Big Data?I looked at the vignette of the related packages and Googled some potential solutions, but I'm still not sure about the answers to the questions above.
Considering the following post,
R fast single item lookup from list vs data.table vs hash
it seems that a single lookup in a data.table
object is actually quite slow, even slower than in a list in Base R?
However a lookup using a hash object from hash
is very speedy, based on this benchmark -- is that accurate?
However, it looks like the object hash is handling only unique keys?
In the following only 2 (key,value) pairs are created.
library(hash)
> h <- hash(c("A","B","A"),c(1,2,3))
> h
<hash> containing 2 key-value pair(s).
A : 3
B : 2
So, if i have a table with (key,values) where a key can have different values, and i want to do a (quick) lookup for the values corresponding to this key, what is the best object/data structure in R to do that ?
Can we still use the hash
object or is data.table
the most appropriate in this case ?
Let's say we are in the context of dealing a problem with very large tables, otherwise this discussion is irrelevant.
Related link: http://www.r-bloggers.com/hash-table-performance-in-r-part-i/
Upvotes: 2
Views: 1252
Reputation: 23211
You're referring to the question in the SO post that you refer to, not the answer.
As you'll see in the answer there, the results you get from the benchmark can change a lot depending on how you utilize data.table
, or any given big data package.
You are correct that with the simplest implementation of hash()
each key has 1 value. There are, of course, work-arounds for this. One would be to have a value which is a string and append the string with your multiple values:
h <- hash(c("Key 1","Key 2","Key 3"),c("1","2","1 and 2"))
h
<hash> containing 3 key-value pair(s).
Key 1 : 1
Key 2 : 2
Key 3 : 1 and 2
Another map be to use a hash table via a hashed environment in R, or perhaps via hashmap()
.
I do not know that there is a single, definitive proof that hash
or data.table
will always be faster. It could always vary by your use case, data, and how you implement them in your code.
In general, I'd say that data.table
might be a more common solution if your use case does not involve a true key-value pair, and no work-around would be needed to address the issue of multiple values per key.
Upvotes: 1