Reputation: 353
I have a data frame containing a mixture of numeric and categorical variables (factors in R) such as follows:
df <- data.frame(
age = c(18, 29, 55, 90, 44),
sex = c("M", "F", "M", "M", "F"),
category = c("cat1", "cat1", "cat2", "cat2", "cat2"))
I need to run some statistical analysis on this data but this analysis requires the input data to be in the form of a numeric matrix. I can of course convert sex
and category
into numeric variables by using something like df$sex <- ifelse(df$sex == "M", 1, 0)
but this is very tedious to do manually when I have a lot of variables. Furthermore, the analysis will return a numeric matrix that I need to reconvert to the format of df
with the original categories, and so I'll need to use something like x$sex <- ifelse(x$sex == 1, "M", "F")
(here, x
holds the return value of the analysis) to reconvert each variable back to the original data format, essentially undoing the first conversion. For simplicity please assume all the categorical variables are non-ordinal (i.e. there is no order to the categories) and binary (only two factor levels).
So I think what I want are 2 functions that can do this automatically. I'm assuming I'll need two functions fct2num
and num2originalfct
. fct2num
will need to return the data frame as a matrix with my variables fct_var <- c("sex", "category")
converted appropriately to numeric but also some sort of a dictionary dict
of the mappings (e.g., sex = "F" : 0, "M" : 1
) that I'll need to pass to num2originalfct
along with the output of the analysis to be recoded back to the original categories. Any ideas or alternatives on how best to accomplish this? Something like this would work for fct2num
for one-hot encoding, but I'm not sure how I would revert the encoding back to the original factors.
Upvotes: 0
Views: 404
Reputation: 72593
You may strip off the factor labels using as.integer
and subtract 1
.
> df[chr] <- lapply(df[chr], \(x) as.integer(as.factor(x)) - 1L)
> m <- as.matrix(df)
> m
age sex category
[1,] 18 1 0
[2,] 29 0 0
[3,] 55 1 1
[4,] 90 1 1
[5,] 44 0 1
Update
To convert forth and back, you may follow an approach like this.
> df <- type.convert(df, as.is=FALSE) ## convert character to factor
> fac <- names(df)[sapply(df, is.factor)] ## store names of which are factors
> lev <- lapply(df[fac], attr, 'levels') ## store levels of the factors
> df[fac] <- lapply(df[fac], \(x, y) as.integer(x) - 1)
> m <- as.matrix(df)
> m
age sex category
[1,] 18 1 0
[2,] 29 0 0
[3,] 55 1 1
[4,] 90 1 1
[5,] 44 0 1
> ## do stuff with matrix
> df2 <- as.data.frame(m) ## convert back to data.frame
> df2[fac] <- Map(`levels<-`, lapply(df2[fac] + 1L, factor), lev) ## restore levels
> df2
age sex category
1 18 M cat1
2 29 F cat1
3 55 M cat2
4 90 M cat2
5 44 F cat2
Data:
> dput(df)
structure(list(age = c(18, 29, 55, 90, 44), sex = c(1, 0, 1,
1, 0), category = c(1, 1, 2, 2, 2)), row.names = c(NA, -5L), class = "data.frame")
Upvotes: 1
Reputation: 79188
We can use a 1-2
coding:
data.matrix(df)
age sex category
[1,] 18 2 1
[2,] 29 1 1
[3,] 55 2 2
[4,] 90 2 2
[5,] 44 1 2
If you need to transform to a 0-1
coding, then subtract 1 from the non-numeric columns:
t(t(data.matrix(df)) - !sapply(df, is.numeric))
age sex category
[1,] 18 1 0
[2,] 29 0 0
[3,] 55 1 1
[4,] 90 1 1
[5,] 44 0 1
iF all the non-numeric have 2-levels, use `model.matrix:
model.matrix(~.,df1)[,-1]
age sexM categorycat2
1 18 1 0
2 29 0 0
3 55 1 1
4 90 1 1
5 44 0 1
Update:
To toggle back and forth (using dataframes) you could use the following:
tonumeric <- function(x){
isfactor <- !sapply(x, is.numeric)
codes <- lapply(x[isfactor], \(x)levels(factor(x)))
x[isfactor] <- data.matrix(x[isfactor]) - 1
structure(x, codes = codes)
}
tocategory <- function(x){
codes <- attr(x, 'codes')
x[names(codes)] <- Map(\(i,j)j[i], x[names(codes)]+1, codes)
x
}
d_numeric <- tonumeric(df)
d_numeric
age sex category
1 18 1 0
2 29 0 0
3 55 1 1
4 90 1 1
5 44 0 1
tocategory(d_numeric)
age sex category
1 18 M cat1
2 29 F cat1
3 55 M cat2
4 90 M cat2
5 44 F cat2
Upvotes: 0