Reputation: 381
I have a data set that has 900 columns of numeric data and I need to convert the numeric columns to factors that have labels. Many labels will repeat. I am trying to write a function that will take the numeric columns, identify the type of label that the column needs, and then apply that label.
Here is an example data frame:
#create data frame with columns a,b,c,d
a<-c(1,2,3,4,5)
b<-c(0,1,0,1,0)
c<-c(1,0,1,0,1)
d<-c(2,3,4,5,3)
x<-as.data.frame(cbind(a,b,c,d))
I have a separate dataframe (i.e. y) that includes a key (i.e. column e) that identifies which factor labels should be applied to which of the columns (i.e. column f). Notice that b and c should have the same label.
e<-c(1,2,2,3)
f<-c("a","b","c","d")
y<-as.data.frame(cbind(e,f))
I would like to write a function that does the following, but automated. Here are the example labels that I would like to apply to a,b,c,d--where a and d are different, but b and c are the same.
x$a<-factor(x$a,
levels=c(1,2,3,4,5),
labels=c("Less than 25%",
"25-50%",
"51-75%",
"76-90%",
"More than 90%"))
x$b<-factor(x$b,
levels=c(0,1),
labels=c("Yes","No"))
x$c<-factor(x$c,
levels=c(0,1),
labels=c("Yes","No"))
x$d<-factor(x$c,
levels=c(1,2,3,4,5),
labels=c("l","m","n","o","p"))
With the final data set looking like:
>x
a b c d
1 Less than 25% Yes No m
2 25-50% No Yes n
3 51-75% Yes No o
4 76-90% No Yes p
5 More than 90% Yes No n
In the actual data set, there will be close to 60 labels.
Upvotes: 2
Views: 3020
Reputation: 36076
If you can get the labels and levels appropriately associated with the e
column, which is your link to the columns of the dataset, you can do this via purrr:pmap_df
.
Here's how that would look. Most of the work is in getting the labels and levels as a list column, which I do via tibble
(loaded with dplyr).
Starting with your second dataset, y
, which is an important part of this.
e = c(1,2,2,3)
f = names(x)
y = data.frame(e,f)
e f
1 1 a
2 2 b
3 2 c
4 3 d
Make sure the levels and labels are available and can be associated with your e
vector. If they are in a long format, you could get them into a list-column format via tidyr::nest
. I found this to be the most time-consuming step in terms of getting this info written out.
library(dplyr)
levels.labels = tibble(e = c(1, 2, 3),
levels = list(1:5, 0:1, 1:5),
labels = list(c("Less than 25%",
"25-50%",
"51-75%",
"76-90%",
"More than 90%"),
c("Yes","No"),
c("l","m","n","o","p")))
If you needed to write your levels and labels out within R you might want to try tribble
, which is available in the development version of the tibble package.
library(tibble)
levels.labels = tribble(~e, ~levels, ~labels,
1, 1:5, c("Less than 25%",
"25-50%",
"51-75%",
"76-90%",
"More than 90%"),
2, 0:1, c("Yes","No"),
3, 1:5, c("l","m","n","o","p"))
Merge the levels and labels with your y
dataset based on e
. The rows of the result is a 1 to 1 match of the columns of x
.
key = left_join(y, levels.labels)
e f levels labels
1 1 a 1, 2, 3, 4, 5 Less than 25%, 25-50%, 51-75%, 76-90%, More than 90%
2 2 b 0, 1 Yes, No
3 2 c 0, 1 Yes, No
4 3 d 1, 2, 3, 4, 5 l, m, n, o, p
To factor each column, put the x dataset, the levels, and the labels all into a named list. The names of each element correspond to the names of the arguments you need to use from factor
. This allows you to easily use pmap_df
from purrr to factor
each column of x
, using the known levels and labels information.
library(purrr)
pmap_df(list(x = x, levels = key$levels, labels = key$labels), factor)
# A tibble: 5 x 4
a b c d
<fctr> <fctr> <fctr> <fctr>
1 Less than 25% Yes No m
2 25-50% No Yes n
3 51-75% Yes No o
4 76-90% No Yes p
5 More than 90% Yes No n
In pmap
functions, the elements within the list must be all the same size. In this case, the first element has 4 columns and the second two are vectors with length 4.
Upvotes: 2