Afiq Johari
Afiq Johari

Reputation: 1462

R dplyr mutate with hash function (digest) requiring R object as input

I want to convert my string ID column into numeric ID column.

For this objective, I'm using the following 2 hash functions: digest() and digest2int().

digest() from library(digest) requires R object as input.

When calling digest() within the dplyr::mutate,instead of applying the digest() for each value from column Species, it only gives 1 unique value across all Species values.

chk <- as_tibble(iris) %>% select(Species) %>% unique() %>%
  mutate(digest_Species = digest(Species, algo = 'md5')) %>%
  mutate(digest2int_Species = digest2int(digest_Species))

chk
# A tibble: 3 x 3
  Species    digest_Species                   digest2int_Species
  <fct>      <chr>                                         <int>
1 setosa     7d4660364a758d3a706737e96be30a54         1339653983
2 versicolor 7d4660364a758d3a706737e96be30a54         1339653983
3 virginica  7d4660364a758d3a706737e96be30a54         1339653983

The expectation is that for each Species value, a unique digest() output, and there should be unique digest2int() value for each Species value.

# When applied individually, we get unique values for each Species color

> 'setosa' |> digest(algo = 'md5') |> digest2int()
[1] -1179682887
> 'versicolor' |> digest(algo = 'md5') |> digest2int()
[1] 1134371908
> 'virginica' |> digest(algo = 'md5') |> digest2int()
[1] 1387061319

I still require the digest() because I think, it will further avoid possible collisions. There're some cases from my dataset where digest2int() gave out similar output for completely 2 different strings.

Upvotes: 2

Views: 809

Answers (2)

akrun
akrun

Reputation: 887691

We could do this in base R

library(digest)
do.call(rbind, lapply(levels(iris$Species), \(x) {
             digest_Species <- digest(x, algo = 'md5')
              digest2int_Species <- digest2int(digest_Species)

              data.frame(Species = x, digest_Species, digest2int_Species)
                    
         }) )

-output

  Species                   digest_Species digest2int_Species
1     setosa 946a2c38121bed59091a362f5015327e        -1179682887
2 versicolor fa66f5fefadcc79a57a5afe78fe680db         1134371908
3  virginica b313f00809c319b9b5918795d13ca47a         1387061319

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389175

digest is not vectorized, you need to apply it for each value separately which can be achieved with rowwise.

library(digest)
library(dplyr)
library(purrr)

as_tibble(iris) %>% 
  select(Species) %>% 
  unique() %>%
  rowwise() %>%
  mutate(digest_Species = digest(Species, algo = 'md5')) %>%
  mutate(digest2int_Species = digest2int(digest_Species))

Or using map functions -

as_tibble(iris) %>% 
  select(Species) %>% 
  unique() %>%
  mutate(digest_Species = map_chr(Species, digest, algo = 'md5'),
         digest2int_Species = map_int(digest_Species, digest2int))


#  Species    digest_Species                   digest2int_Species
#  <fct>      <chr>                                         <int>
#1 setosa     f5a187dcc8d8e2f7ec01ac6ed18ea806         -998405698
#2 versicolor 4ef278aeef19142e65778e37cc22f74a        -1359120580
#3 virginica  6effd04f2dd243f634073b96ec64f149          662748619

Upvotes: 2

Related Questions