elbord77
elbord77

Reputation: 353

R functions to convert factors to numeric and then back to original factor levels

I have a data frame containing a mixture of numeric and categorical variables (factors in R) such as follows:

df <- data.frame(
                 age = c(18, 29, 55, 90, 44),
                 sex = c("M", "F", "M", "M", "F"),
                 category = c("cat1", "cat1", "cat2", "cat2", "cat2"))

I need to run some statistical analysis on this data but this analysis requires the input data to be in the form of a numeric matrix. I can of course convert sex and category into numeric variables by using something like df$sex <- ifelse(df$sex == "M", 1, 0) but this is very tedious to do manually when I have a lot of variables. Furthermore, the analysis will return a numeric matrix that I need to reconvert to the format of df with the original categories, and so I'll need to use something like x$sex <- ifelse(x$sex == 1, "M", "F") (here, x holds the return value of the analysis) to reconvert each variable back to the original data format, essentially undoing the first conversion. For simplicity please assume all the categorical variables are non-ordinal (i.e. there is no order to the categories) and binary (only two factor levels).

So I think what I want are 2 functions that can do this automatically. I'm assuming I'll need two functions fct2num and num2originalfct. fct2num will need to return the data frame as a matrix with my variables fct_var <- c("sex", "category") converted appropriately to numeric but also some sort of a dictionary dict of the mappings (e.g., sex = "F" : 0, "M" : 1) that I'll need to pass to num2originalfct along with the output of the analysis to be recoded back to the original categories. Any ideas or alternatives on how best to accomplish this? Something like this would work for fct2num for one-hot encoding, but I'm not sure how I would revert the encoding back to the original factors.

Upvotes: 0

Views: 404

Answers (2)

jay.sf
jay.sf

Reputation: 72593

You may strip off the factor labels using as.integer and subtract 1.

> df[chr] <- lapply(df[chr], \(x) as.integer(as.factor(x)) - 1L)
> m <- as.matrix(df)
> m
     age sex category
[1,]  18   1        0
[2,]  29   0        0
[3,]  55   1        1
[4,]  90   1        1
[5,]  44   0        1

Update

To convert forth and back, you may follow an approach like this.

> df <- type.convert(df, as.is=FALSE)  ## convert character to factor
> fac <- names(df)[sapply(df, is.factor)]  ## store names of which are factors
> lev <- lapply(df[fac], attr, 'levels')  ## store levels of the factors
> df[fac] <- lapply(df[fac], \(x, y) as.integer(x) - 1)
> m <- as.matrix(df)
> m
     age sex category
[1,]  18   1        0
[2,]  29   0        0
[3,]  55   1        1
[4,]  90   1        1
[5,]  44   0        1
> ## do stuff with matrix
> df2 <- as.data.frame(m)  ## convert back to data.frame
> df2[fac] <- Map(`levels<-`, lapply(df2[fac] + 1L, factor), lev) ## restore levels
> df2
  age sex category
1  18   M     cat1
2  29   F     cat1
3  55   M     cat2
4  90   M     cat2
5  44   F     cat2

Data:

> dput(df)
structure(list(age = c(18, 29, 55, 90, 44), sex = c(1, 0, 1, 
1, 0), category = c(1, 1, 2, 2, 2)), row.names = c(NA, -5L), class = "data.frame")

Upvotes: 1

Onyambu
Onyambu

Reputation: 79188

We can use a 1-2 coding:

data.matrix(df)

     age sex category
[1,]  18   2        1
[2,]  29   1        1
[3,]  55   2        2
[4,]  90   2        2
[5,]  44   1        2

If you need to transform to a 0-1 coding, then subtract 1 from the non-numeric columns:

t(t(data.matrix(df)) - !sapply(df, is.numeric))
     age sex category
[1,]  18   1        0
[2,]  29   0        0
[3,]  55   1        1
[4,]  90   1        1
[5,]  44   0        1

iF all the non-numeric have 2-levels, use `model.matrix:

model.matrix(~.,df1)[,-1]
  age sexM categorycat2
1  18    1            0
2  29    0            0
3  55    1            1
4  90    1            1
5  44    0            1

Update:

To toggle back and forth (using dataframes) you could use the following:

tonumeric <- function(x){
    isfactor <- !sapply(x, is.numeric)
    codes <- lapply(x[isfactor], \(x)levels(factor(x)))
    x[isfactor] <- data.matrix(x[isfactor]) - 1
    structure(x, codes = codes)
}

tocategory <- function(x){
    codes <- attr(x, 'codes')
    x[names(codes)] <- Map(\(i,j)j[i], x[names(codes)]+1, codes)
    x
}

d_numeric <- tonumeric(df)
d_numeric
  age sex category
1  18   1        0
2  29   0        0
3  55   1        1
4  90   1        1
5  44   0        1

tocategory(d_numeric)
  age sex category
1  18   M     cat1
2  29   F     cat1
3  55   M     cat2
4  90   M     cat2
5  44   F     cat2

Upvotes: 0

Related Questions