elikesprogramming
elikesprogramming

Reputation: 2588

How to preserve original values in a variable turned into a factor?

Here's some working code to illustrate my question:

# Categorical variable recorded as numeric (integer)
df1 <- data.frame(group = c(1, 2, 3, 9, 3, 2, 9, 1, 9, 3, 2))

I have a categorical variable (group) recorded as integer values. For plots and to include this variable in models, it would be useful to have it encoded as factor, mapping each number to a label describing the category. So I crete a factor:

# Make it a factor
df1$group_f <- factor(x = df1$group, 
                      levels = c(1, 2, 3, 9), 
                      labels = c("G1", "G2", "G3", "Unknown"))

df1
   group group_f
1      1      G1
2      2      G2
3      3      G3
4      9 Unknown
5      3      G3
6      2      G2
7      9 Unknown
8      1      G1
9      9 Unknown
10     3      G3
11     2      G2

Now, the problem is that eventually I need the original values again (because I have to join tables based on this variable, and the other table has the original numbers for each category -1,2,3,9- and not the labels).

Converting to numeric does not work ("Unknown" category gets mapped to 4 instead of 9)

# And back to numeric
df1$group_num <- as.numeric(df1$group_f)

df1

   group group_f group_num
1      1      G1         1
2      2      G2         2
3      3      G3         3
4      9 Unknown         4
5      3      G3         3
6      2      G2         2
7      9 Unknown         4
8      1      G1         1
9      9 Unknown         4
10     3      G3         3
11     2      G2         2

?factor says:

as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

But as.numeric over the levels does not work either ('cause levels now are character with the labels, so cannot be coerced to numeric):

> as.numeric(levels(df1$group_f))
[1] NA NA NA NA
Warning message:
NAs introduced by coercion 

Is there a way to create a factor variable, so that it preserves the original values? (1,2,3,9 in this example)???

Note: the idea is to have one single factor variable that has the labels describing the categories, and the original number underlying. Although in this example I keep the variable group along the newly created factor variable, in my real use case I would/can not do that (it is a huge dataset).

Upvotes: 4

Views: 2082

Answers (1)

Chris
Chris

Reputation: 820

If you keep the levels and labels vectors used to create the factor, you can use those to work backwards from the factor label to get back to the value.

group_levels <- c(1, 2, 3, 9)
group_labels <- c("G1", "G2", "G3", "Unknown")
df1$reconstituted_group_num <- group_levels[as.numeric(df1$group_f)]

This works because the index value from the labels vector lines up with the index value in the levels vector: Unknown has index 4, and so does its level 9.

Upvotes: 1

Related Questions