xiaodai
xiaodai

Reputation: 16064

R: Fast hashing of strings to integer modulo n?

I have a vector of strings and I would like to hash each element individually to integers modulo n.

In this SO post it suggests an approach using digest and strotoi. But when I try it I get NA as the returned value

library(digest)
strtoi(digest("cc", algo = "xxhash32"), 16L)

So the above approach will not work as it can not even produce an integer let alone modulo of one.

What's the best way to hash a large vector of strings to integers modulo n for some n? Efficient solutions are more than welcome as the vector is large.

Upvotes: 3

Views: 809

Answers (2)

xiaodai
xiaodai

Reputation: 16064

I made a Rcpp implementation using code from this SO post and the resultant code is quite fast even for large-ish string vectors.

To use it

if(!require(disk.frame)) devtools::install_github("xiaodaigh/disk.frame")
modn = 17
disk.frame::hashstr2i(c("string1","string2"), modn)

Upvotes: 1

Birger
Birger

Reputation: 1141

R uses 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9. strtoi returns NA because the number is too big.

The mpfr-function from the Rmpfr package should work for you:

mpfr(x = digest("cc`enter code here`", algo = "xxhash32"), base = 16)
[1] 4192999065

Upvotes: 2

Related Questions