Reputation: 459

Programmatically factorize selected columns in data frame, the tidy way?

Here is a simplified example:

library(tidyverse)

frame <- tribble(
  ~a, ~b, ~c,
   1,  1,  2,
   5,  4,  7,
   2,  3,  4, 
   3,  1,  6
)

key <- tribble(
  ~col, ~name, ~type, ~labels,
     1,   "a",   "f",     c("one", "two", "three", "four", "five"),
     2,   "b",   "f",     c("uno", "dos", "tres", "cuatro"),
     3,   "c",   "f",     1:7
)

Is there an elegant way of programmatically sweeping across the columns in frame and applying the specific factor class, based on the parameters in key? The expected result would be:

# A tibble: 4 x 3
       a      b      c
  <fctr> <fctr> <fctr>
1    one    uno      2
2   five cuatro      7
3    two   tres      4
4  three    uno      6

The best solution I have so far is using purrr's map2() but with assignment that is IMO not the most elegant:

frame[key$col] <- map2(key$col, key$labels, 
        function(x, y) factor(frame[[x]], levels = 1:length(y), labels = y))

Does anyone have a more tidy solution? Note that my original data frame has hundreds of columns and I need to re-factor with different levels/labels a majority of them, so the process has to be automated.

Upvotes: 2

Answers (4)

LVG77

Reputation: 206

Here is another solution. I am not sure how "elegant" it is. Hopefully, someone can improve on that.

suppressPackageStartupMessages(library(tidyverse))

frame <- tribble(
  ~a, ~b, ~c,
  1,  1,  2,
  5,  4,  7,
  2,  3,  4, 
  3,  1,  6
)

key <- tribble(
  ~col, ~name, ~type, ~labels,
  1,   "a",   "f",     c("one", "two", "three", "four", "five"),
  2,   "b",   "f",     c("uno", "dos", "tres", "cuatro"),
  3,   "c",   "f",     1:7
)

colnames(frame) %>% 
  map(~ {
    factor(pull(frame, .x),
           levels = 1:length(pluck(key[key$name == .x, "labels"], 1, 1)),
           labels = pluck(key[key$name == .x, "labels"], 1, 1))
  }) %>% 
  set_names(colnames(frame)) %>% 
  as_tibble()
#> # A tibble: 4 x 3
#>        a      b      c
#>   <fctr> <fctr> <fctr>
#> 1    one    uno      2
#> 2   five cuatro      7
#> 3    two   tres      4
#> 4  three    uno      6

Upvotes: 1

Onyambu

Reputation: 79288

For this question, you can use a base R code:

(A=`names<-`(data.frame(mapply(function(x,y)x[y],key$labels,frame)),key$name))
      a      b c
1   one    uno 2
2  five cuatro 7
3   two   tres 4
4 three    uno 6

 sapply(A,class)
   a        b        c 
"factor" "factor" "factor"

Upvotes: 0

markdly

Reputation: 4534

I'm interested to see what other solutions are proposed for this. My only suggestion is to change the proposed solution slightly so it is clearer that frame is going to be modified in some way rather than leaving it in the body of the function used by map2.

For example, pass frame as an additional argument in the call to map2:

frame[key$col] <- map2(key$col, key$labels, 
                       function(x, y, z) factor(z[[x]], levels = 1:length(y), labels = y), 
                       frame)

Or do the same thing using the pipe operator %>%:

frame[key$col] <- frame %>%
  { map2(key$col, key$labels, 
         function(x, y, z) factor(z[[x]], levels = 1:length(y), labels = y), .) }

Upvotes: 0

David

Reputation: 10222

I don't know if this answer satisfies your requirements of being tidy as it uses a plain old for-loop. But it does the job and in my opinion is easy to read/understand as well as reasonably fast.

library(tidyverse)
frame <- tribble(
 ~a, ~b, ~c,
 1,  1,  2,
 5,  4,  7,
 2,  3,  4, 
 3,  1,  6
)

key <- tribble(
 ~col, ~name, ~type, ~labels,
 1,   "a",   "f",     c("one", "two", "three", "four", "five"),
 2,   "b",   "f",     c("uno", "dos", "tres", "cuatro"),
 3,   "c",   "f",     1:7
)

for (i in 1:nrow(key)) {
 var <- key$name[[i]]
 x <- frame[[var]]
 labs <- key$labels[[i]]
 lvls <- 1:max(length(x), length(labs)) # make sure to have the right lengths

 frame <- frame %>% mutate(!! var := factor(x, levels = lvls, labels = labs))
}

frame
#> # A tibble: 4 x 3
#>        a      b      c
#>   <fctr> <fctr> <fctr>
#> 1    one    uno      2
#> 2   five cuatro      7
#> 3    two   tres      4
#> 4  three    uno      6

The typical tidy-approach would be to reshape the data to have all variables in one column, then apply a function to that column, and finally reshaping it to the original format. However, factors don't really like that, thus we need to use other means. Are factors even considered tidy?

Edit

Regarding my assumption that the for-loop would be similar to the map2-function, I was wrong.

Here are some benchmarks:

library(microbenchmark)

frame1 <- frame
frame2 <- frame

microbenchmark(
 map2 = {
  frame1[key$col] <- map2(key$col, key$labels, 
                          function(x, y) factor(frame[[x]], 
                                                levels = 1:max(frame[[x]],
                                                               length(y)), 
                                                labels = y))
 },
 forloop = {
  for (i in 1:nrow(key)) {
   var <- key$name[[i]]
   x <- frame2[[var]]
   labs <- key$labels[[i]]
   lvls <- 1:max(length(x), length(labs))
   frame2 <- frame2 %>% mutate(!! var := factor(x, levels = lvls, labels = labs))
  }
 }
)

# Unit: microseconds
# expr         min         lq       mean    median         uq       max neval cld
# map2      375.53   416.5805   514.3126   450.825   484.2175  3601.636   100  a 
# forloop 11407.80 12110.0090 12816.6606 12564.176 13425.6840 16632.682   100   b

Upvotes: 0

Programmatically factorize selected columns in data frame, the tidy way?

Answers (4)

Edit

Related Questions