Sam
Sam

Reputation: 381

Convert numeric columns to factors with different labels using key

I have a data set that has 900 columns of numeric data and I need to convert the numeric columns to factors that have labels. Many labels will repeat. I am trying to write a function that will take the numeric columns, identify the type of label that the column needs, and then apply that label.

Here is an example data frame:

#create data frame with columns a,b,c,d
a<-c(1,2,3,4,5)
b<-c(0,1,0,1,0)
c<-c(1,0,1,0,1)
d<-c(2,3,4,5,3)

x<-as.data.frame(cbind(a,b,c,d))

I have a separate dataframe (i.e. y) that includes a key (i.e. column e) that identifies which factor labels should be applied to which of the columns (i.e. column f). Notice that b and c should have the same label.

e<-c(1,2,2,3)
f<-c("a","b","c","d")

y<-as.data.frame(cbind(e,f))

I would like to write a function that does the following, but automated. Here are the example labels that I would like to apply to a,b,c,d--where a and d are different, but b and c are the same.

x$a<-factor(x$a,
    levels=c(1,2,3,4,5),
    labels=c("Less than 25%",
    "25-50%",
    "51-75%",
    "76-90%",
    "More than 90%"))

x$b<-factor(x$b,
    levels=c(0,1),
    labels=c("Yes","No"))

x$c<-factor(x$c,
    levels=c(0,1),
    labels=c("Yes","No"))

x$d<-factor(x$c,
    levels=c(1,2,3,4,5),
    labels=c("l","m","n","o","p"))

With the final data set looking like:

>x
              a   b   c d
1 Less than 25% Yes  No m
2        25-50%  No Yes n
3        51-75% Yes  No o
4        76-90%  No Yes p
5 More than 90% Yes  No n

In the actual data set, there will be close to 60 labels.

Upvotes: 2

Views: 3020

Answers (1)

aosmith
aosmith

Reputation: 36076

If you can get the labels and levels appropriately associated with the e column, which is your link to the columns of the dataset, you can do this via purrr:pmap_df.

Here's how that would look. Most of the work is in getting the labels and levels as a list column, which I do via tibble (loaded with dplyr).

Starting with your second dataset, y, which is an important part of this.

e = c(1,2,2,3)
f =  names(x)

y = data.frame(e,f)

  e f
1 1 a
2 2 b
3 2 c
4 3 d

Make sure the levels and labels are available and can be associated with your e vector. If they are in a long format, you could get them into a list-column format via tidyr::nest. I found this to be the most time-consuming step in terms of getting this info written out.

library(dplyr)

levels.labels = tibble(e = c(1, 2, 3),
             levels = list(1:5, 0:1, 1:5),
             labels = list(c("Less than 25%",
                             "25-50%",
                             "51-75%",
                             "76-90%",
                             "More than 90%"),
                           c("Yes","No"),
                           c("l","m","n","o","p")))

If you needed to write your levels and labels out within R you might want to try tribble, which is available in the development version of the tibble package.

library(tibble)
levels.labels  = tribble(~e, ~levels, ~labels,
      1, 1:5, c("Less than 25%",
               "25-50%",
               "51-75%",
               "76-90%",
               "More than 90%"),
      2, 0:1, c("Yes","No"),
      3, 1:5, c("l","m","n","o","p"))

Merge the levels and labels with your y dataset based on e. The rows of the result is a 1 to 1 match of the columns of x.

key = left_join(y, levels.labels)

  e f        levels                                               labels
1 1 a 1, 2, 3, 4, 5 Less than 25%, 25-50%, 51-75%, 76-90%, More than 90%
2 2 b          0, 1                                              Yes, No
3 2 c          0, 1                                              Yes, No
4 3 d 1, 2, 3, 4, 5                                        l, m, n, o, p

To factor each column, put the x dataset, the levels, and the labels all into a named list. The names of each element correspond to the names of the arguments you need to use from factor. This allows you to easily use pmap_df from purrr to factor each column of x, using the known levels and labels information.

library(purrr)
pmap_df(list(x = x, levels = key$levels, labels = key$labels), factor)

# A tibble: 5 x 4
              a      b      c      d
         <fctr> <fctr> <fctr> <fctr>
1 Less than 25%    Yes     No      m
2        25-50%     No    Yes      n
3        51-75%    Yes     No      o
4        76-90%     No    Yes      p
5 More than 90%    Yes     No      n

In pmap functions, the elements within the list must be all the same size. In this case, the first element has 4 columns and the second two are vectors with length 4.

Upvotes: 2

Related Questions