EricAtani
EricAtani

Reputation: 65

How to replace all values in a column based on an ordered vector in r

I am trying to replace all numeric values in a column of a data frame with ordered categories. Here's a dummy data frame:

df <- data.frame(a = c(1:100), b = sample(c(0,20), size = 100, replace = TRUE), c = c(1:100))

Note that the actual data frame in question is a dta file imported with haven::read_dta(). The actual data frame can be found on GSS here. I am working on the file of 2018, and want to replace all values in b, i.e. 0 to 20, with a group of categories as this:

educ_vec <- c("No formal schooling", "1st grade", "2nd grade", "3rd grade", "4th grade", "5th grade", "6th grade", "7th grade", "8th grade", "9th grade", "10th grade", "11th grade", "12th grade", "1 year of college", "2 years of college", "3 years of college", "4 years of college", "5 years of college", "6 years of college", "7 years of college", "8 years of college")
educ_fac <- factor(educ_vec, ordered = TRUE, levels = educ_vec)

If I use mutate and ifelse for every single category, the process is too long, and it does not retain the order in educ_fac. I tried a few ways to do this in one step and did not succeed. One way was this:

gss_df %>% 
  mutate(educ = fct_recode(educ, 
                           "No formal schooling" = 0, 
                           "1st grade" = 1, 
                           "2nd grade" = 2, 
                           "3rd grade" = 3, 
                           "4th grade" = 4, 
                           "5th grade" = 5, 
                           "6th grade" = 6, 
                           "7th grade" = 7, 
                           "8th grade" = 8, 
                           "9th grade" = 9, 
                           "10th grade" = 10, 
                           "11th grade" = 11, 
                           "12th grade" = 12, 
                           "1 year of college" = 13, 
                           "2 years of college" = 14, 
                           "3 years of college" = 15, 
                           "4 years of college" = 16, 
                           "5 years of college" = 17, 
                           "6 years of college" = 18, 
                           "7 years of college" = 19, 
                           "8 years of college" = 20))

Error: `f` must be a factor (or character vector or numeric vector).

The other two ways were similar but no success either:

gss_df %>% 
  mutate(educ = fct_recode(educ, educ_fac))

Error: `f` must be a factor (or character vector or numeric vector).
gss_df %>% 
  mutate(educ = recode_factor(educ, educ_vec, ordered = TRUE))

Error in UseMethod("recode") : no applicable method for 'recode' applied to an object of class "haven_labelled"

Could anyone please give a solution to this?

Upvotes: 3

Views: 586

Answers (2)

StupidWolf
StupidWolf

Reputation: 46908

I cannot read in the dta file for some reasons, so below I simulate data to show you my suggestion. You start with your educ_vec vector.

educ_vec <- c("No formal schooling", "1st grade", 
"2nd grade", "3rd grade", "4th grade", "5th grade", 
"6th grade", "7th grade", "8th grade", "9th grade", 
"10th grade", "11th grade", "12th grade", "1 year of college", 
"2 years of college", "3 years of college", "4 years of college", 
"5 years of college", "6 years of college", "7 years of college", 
"8 years of college")

If you look at the educ_vec , it is already in the format you want

# this is meant for 0
educ_vec[1]
[1] "No formal schooling"
# this is meant for 20
educ_vec[21]
[1] "8 years of college"

If your score is i, the new categorical value will be educ_vec[i+1]; so we can make use of this below:

set.seed(100)
gss_df <- data.frame(educ=sample(0:20,30,replace=TRUE))
gss_df %>% 
mutate(new=factor(educ_vec[educ+1],ordered = TRUE, levels = educ_vec))

   educ                new
1     9          9th grade
2     5          5th grade
3    15 3 years of college
4    18 6 years of college
5    13  1 year of college
6    11         11th grade
7     5          5th grade
8     3          3rd grade
9     5          5th grade
10    1          1st grade
11    6          6th grade
12    6          6th grade
13   10         10th grade
14   17 5 years of college
15   11         11th grade
16    2          2nd grade
17   18 6 years of college
18    7          7th grade
19   17 5 years of college
20    1          1st grade
21   18 6 years of college
22    3          3rd grade
23    3          3rd grade
24   19 7 years of college
25   15 3 years of college
26   20 8 years of college
27    6          6th grade
28   15 3 years of college
29   10         10th grade
30   19 7 years of college

And yes it works if some of the factors are not found in the data:

gss_df <- data.frame(educ=0:5)%>%
mutate(new=factor(educ_vec[educ+1],ordered = TRUE, levels = educ_vec))

  educ                 new
1    0 No formal schooling
2    1           1st grade
3    2           2nd grade
4    3           3rd grade
5    4           4th grade
6    5           5th grade

You can see the new column is a factor with the intended categories.

str(gss_df)
'data.frame':   6 obs. of  2 variables:
 $ educ: int  0 1 2 3 4 5
 $ new : Ord.factor w/ 21 levels "No formal schooling"<..: 1 2 3 4 5 6

If you have scores that are not in 0-20, for example -1, -2 or 21,22 etc.. then I suggest doing the following:

names(educ_vec) = 0:20
gss_df <- data.frame(educ=c(-1,0,20,21))
# you can also use mutate
gss_df$new <- educ_vec[match(gss_df$educ,names(educ_vec))]
gss_df

  educ                 new
1   -1                <NA>
2    0 No formal schooling
3   20  8 years of college
4   21                <NA>

Match will return a NA if it cannot find the corresponding name in your educ_vec

Upvotes: 1

deepseefan
deepseefan

Reputation: 3791

Another way to approach the problem will be to use a named vector and do factor ordering later. There are a coupe of ways to skin this problem once you read the .dta file to your workspace.

set.seed(777)
library(tidyverse)
df <- data.frame(a = c(1:100), b = sample(c(0:20), size = 100, replace = TRUE), c = c(1:100))

# -------------------------------------------------------------------------
head(df)
#   a  b c
# 1 1  0 1
# 2 2 18 2
# 3 3 11 3
# 4 4  9 4
# 5 5 11 5
# 6 6  8 6

# -------------------------------------------------------------------------

# this will be used as name istead
educ_vec <- c("No formal schooling", "1st grade", "2nd grade", "3rd grade", "4th grade", "5th grade", "6th grade", "7th grade", "8th grade", "9th grade", "10th grade", "11th grade", "12th grade", "1 year of college", "2 years of college", "3 years of college", "4 years of college", "5 years of college", "6 years of college", "7 years of college", "8 years of college")

# alues as char from 0 to 20
value_vec <- as.character(seq(21)-1)

# assign educ_vec as names 
names(value_vec) <- educ_vec

# fct_recode b
df$educ <- fct_recode(factor(df$b), !!!value_vec)

# set educ as ordered factor using educ_vec as levels
df$educ <- factor(df$educ, ordered = TRUE, levels = educ_vec)

# -------------------------------------------------------------------------
head(df)
#   a  b c                educ
# 1 1  0 1 No formal schooling
# 2 2 18 2  6 years of college
# 3 3 11 3          11th grade
# 4 4  9 4           9th grade
# 5 5 11 5          11th grade
# 6 6  8 6           8th grade

# -------------------------------------------------------------------------


Upvotes: 1

Related Questions