Reputation: 2263
I have 150k columns of 105 million entries which are either "none", "01", "12", "2+"
. Unfortunately not all columns contain all of the factors.
e.g.
df <- data.frame(x1 = rep(c("none", "12", "2+"), each = 5),
x2 = rep(c("none", "01", "12"), each = 5)) %>%
data.table::as.data.table()
so if I do
df$x1<-as.integer(as.factor(df$x1))
I get the same as
df$x2<-as.integer(as.factor(df$x2))
which isn't what I'm after.
So I could do:
require(magrittr)
df$x1<-factor(df$x1,levels = c("none","01","12","2+")) %>% as.integer()
df$x2<-factor(df$x2,levels = c("none","01","12","2+")) %>% as.integer()
And that does the job but I have 150K columns. What is the best way to deal with them as I can't do the above one by one?
Upvotes: 2
Views: 47
Reputation: 76621
Here is a data.table
solution.
With a large data set, instead of calling names(df)
twice, it might be a good idea to call it just once, assigning the value prior to transforming the df
's columns and then using that vector of 150K names.
library(data.table)
levs <- c("none","01","12","2+")
df[, (names(df)) := lapply(.SD, factor, levels = levs), .SDcols = names(df)]
identical(levels(df$x1), levels(df$x2))
#[1] TRUE
So now use the code above to coerce the levels to integer.
df[, (names(df)) := lapply(.SD, function(x){
as.integer(factor(x, levels = levs))
}), .SDcols = names(df)]
Upvotes: 2
Reputation: 887691
If we want to apply on multiple columns use across
library(dplyr)
df1 <- df %>%
mutate(across(everything(), ~
as.integer(factor(., levels = c("none","01","12","2+"))))
If we want to ignore the first one, specify the index with -
df1 <- df %>%
mutate(across(-1, ~
as.integer(factor(., levels = c("none","01","12","2+"))))
Or use base R
df[] <- lapply(df, function(x) as.integer(factor(x, levels = c("none","01","12","2+"))))
Upvotes: 2