Reputation: 1462
I want to convert my string ID column into numeric ID column.
For this objective, I'm using the following 2 hash functions: digest()
and digest2int()
.
digest()
from library(digest)
requires R object
as input.
When calling digest()
within the dplyr::mutate
,instead of applying the digest()
for each value from column Species
, it only gives 1 unique value across all Species
values.
chk <- as_tibble(iris) %>% select(Species) %>% unique() %>%
mutate(digest_Species = digest(Species, algo = 'md5')) %>%
mutate(digest2int_Species = digest2int(digest_Species))
chk
# A tibble: 3 x 3
Species digest_Species digest2int_Species
<fct> <chr> <int>
1 setosa 7d4660364a758d3a706737e96be30a54 1339653983
2 versicolor 7d4660364a758d3a706737e96be30a54 1339653983
3 virginica 7d4660364a758d3a706737e96be30a54 1339653983
The expectation is that for each Species
value, a unique digest()
output, and there should be unique digest2int()
value for each Species
value.
# When applied individually, we get unique values for each Species color
> 'setosa' |> digest(algo = 'md5') |> digest2int()
[1] -1179682887
> 'versicolor' |> digest(algo = 'md5') |> digest2int()
[1] 1134371908
> 'virginica' |> digest(algo = 'md5') |> digest2int()
[1] 1387061319
I still require the digest()
because I think, it will further avoid possible collisions.
There're some cases from my dataset where digest2int()
gave out similar output for completely 2 different strings.
Upvotes: 2
Views: 809
Reputation: 887691
We could do this in base R
library(digest)
do.call(rbind, lapply(levels(iris$Species), \(x) {
digest_Species <- digest(x, algo = 'md5')
digest2int_Species <- digest2int(digest_Species)
data.frame(Species = x, digest_Species, digest2int_Species)
}) )
-output
Species digest_Species digest2int_Species
1 setosa 946a2c38121bed59091a362f5015327e -1179682887
2 versicolor fa66f5fefadcc79a57a5afe78fe680db 1134371908
3 virginica b313f00809c319b9b5918795d13ca47a 1387061319
Upvotes: 2
Reputation: 389175
digest
is not vectorized, you need to apply it for each value separately which can be achieved with rowwise
.
library(digest)
library(dplyr)
library(purrr)
as_tibble(iris) %>%
select(Species) %>%
unique() %>%
rowwise() %>%
mutate(digest_Species = digest(Species, algo = 'md5')) %>%
mutate(digest2int_Species = digest2int(digest_Species))
Or using map
functions -
as_tibble(iris) %>%
select(Species) %>%
unique() %>%
mutate(digest_Species = map_chr(Species, digest, algo = 'md5'),
digest2int_Species = map_int(digest_Species, digest2int))
# Species digest_Species digest2int_Species
# <fct> <chr> <int>
#1 setosa f5a187dcc8d8e2f7ec01ac6ed18ea806 -998405698
#2 versicolor 4ef278aeef19142e65778e37cc22f74a -1359120580
#3 virginica 6effd04f2dd243f634073b96ec64f149 662748619
Upvotes: 2