Reputation: 16064
I have a vector of strings and I would like to hash each element individually to integers modulo n.
In this SO post it suggests an approach using digest
and strotoi
. But when I try it I get NA
as the returned value
library(digest)
strtoi(digest("cc", algo = "xxhash32"), 16L)
So the above approach will not work as it can not even produce an integer let alone modulo of one.
What's the best way to hash a large vector of strings to integers modulo n for some n? Efficient solutions are more than welcome as the vector is large.
Upvotes: 3
Views: 809
Reputation: 16064
I made a Rcpp implementation using code from this SO post and the resultant code is quite fast even for large-ish string vectors.
To use it
if(!require(disk.frame)) devtools::install_github("xiaodaigh/disk.frame")
modn = 17
disk.frame::hashstr2i(c("string1","string2"), modn)
Upvotes: 1
Reputation: 1141
R uses 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9. strtoi
returns NA
because the number is too big.
The mpfr
-function from the Rmpfr
package should work for you:
mpfr(x = digest("cc`enter code here`", algo = "xxhash32"), base = 16)
[1] 4192999065
Upvotes: 2