user29756984
user29756984

Reputation: 21

R Function to replace all instances of "f" in a factored variable?

I'm new to R - so I'm sorry if this has been asked but I've not found a solution online. I have a data set of survey responses related to gender and sex that were typed in by 350 participants. Many of the responses are the same thing but typed/spelled differently. Below are some of the outcomes I get when I run "unique(df$variable)". There is a lot of variation, misspelling, differences in capitalizations, etc.

[1] Male                                                 Female                                              
 [3] female                                               Female/woman                                        
 [5] Female                                               F                                                   
 [7] female                                               Woman                                               
 [9] Cis female, she her                                  Female cisgender                                    
[11] Female heterosexual                                  I identify as a trans woman!                        
[13] Demiboy                                              Transwoman                                          
[15] My sex is female and my gender identity is nonbinary male                                                
[17] m                                                    woman                                               
[19] Woman                                                Nonbinary                                           
[21] my gender doesn't exist                              Male/AMAB                                           

What I've done: I have tried classifying all unique values and replacing with mutate:

f <- c("Female/woman", "female", "Female cisgender", "Female", "Woman", "woman", "Women", "women", "f", "F" )
m <- c("male", "Cis Male", "Male", "m", "M", "ma,e=]]")
gq <- c("genderqueer", "nonbinary", "genderfluid")

df |> 
    mutate(GenderNew = case_when(
              GenderSex %in% f ~ "F",
           GenderSex %in% m ~ "M",
            GenderSex %in% gq ~ "Q",
)) -> df_new

But this gave me multiple NA in my GenderNew column. I've also not had success with grepl.

What I'm looking to do: Replace all occurrences of the letter "f" regardless of position in the string/participant response with the letter "F". Same with responses indicating male or genderqueer. I would like my outcome to be either "M", "F", "GQ", or whatever the participant typed as a response so I can re-code it without missing anything.


GenderSex <-
c("Male", "Female", "female", "Female/woman", "Female", "F", 
"female", "Woman", "Cis female, she her", "Female cisgender", 
"Female heterosexual", "I identify as a trans woman!", "Demiboy", 
"Transwoman", "My sex is female and my gender identity is nonbinary", 
"male", "m", "woman", "Woman", "Nonbinary", "my gender doesn't exist", 
"Male/AMAB")

Upvotes: 2

Views: 92

Answers (2)

jay.sf
jay.sf

Reputation: 73272

I like your initial approach better. I don't suppose your gender variable has millions of expressions, so dividing the unique and sorted values by hand should be a safe option. dput prepares a vector on the console you can copy.

> df$gender |> unique() |> sort() |> dput()
c("Cis female, she her", "Demiboy", "F", "female", "Female", 
  "Female cisgender", "Female heterosexual", "Female/woman", "I identify as a trans woman!", 
  "m", "male", "Male", "Male/AMAB", "my gender doesn't exist", 
  "Nonbinary", "Transwoman", "woman", "Woman")

Since we want F, M, and others (Q), we just need the first two guys.

> f <- c("Cis female, she her", "F", "female", "Female", "Female cisgender", 
+        "Female heterosexual", "Female/woman", "woman", "Woman")
> m <- c("m", "male", "Male", "Male/AMAB")

Then just replace three times. It is easy to read and more efficient than ifelse or the like.

> df |> 
+   transform(gender_new=replace(gender, gender %in% f, 'F')) |> 
+   transform(gender_new=replace(gender_new, gender %in% m, 'M')) |> 
+   transform(gender_new=replace(gender_new, !gender %in% c(f, m), 'Q')) 
                         gender           x gender_new
1                         Woman -0.09465904          F
2                  Female/woman  0.63286260          F
3                  Female/woman  0.63286260          F
4                     Male/AMAB -1.78130843          M
5                         woman -2.65645542          F
6                       Demiboy -1.38886070          Q
7                        Female  0.40426832          F
8                  Female/woman  0.63286260          F
9                        Female -0.56469817          F
10                        woman -2.65645542          F
11                       female  0.36312841          F
12                            m -0.28425292          M
13      my gender doesn't exist -0.30663859          Q
14                        woman -2.65645542          F
15                            F -0.10612452          F
16                            F -0.10612452          F
17                       Female -0.56469817          F
18                    Nonbinary  1.32011335          Q
19                       female  0.36312841          F
20                    Male/AMAB -1.78130843          M
21      my gender doesn't exist -0.30663859          Q
22                       Female -0.56469817          F
23                            F -0.10612452          F
24             Female cisgender -0.06271410          F
25                        Woman -0.09465904          F
26                       Female  0.40426832          F
27                         Male  1.37095845          M
28                            m -0.28425292          M
29                       female  1.51152200          F
30                 Female/woman  0.63286260          F
31                      Demiboy -1.38886070          Q
32             Female cisgender -0.06271410          F
33          Cis female, she her  2.01842371          F
34 I identify as a trans woman!  2.28664539          Q
35                    Nonbinary  1.32011335          Q
36          Cis female, she her  2.01842371          F
37          Female heterosexual  1.30486965          F
38                       female  0.36312841          F
39                         male  0.63595040          M
40                       Female  0.40426832          F
41                   Transwoman -0.27878877          Q
42                       Female  0.40426832          F
43                    Male/AMAB -1.78130843          M
44                       Female -0.56469817          F
45                        woman -2.65645542          F
46                            m -0.28425292          M
47                        woman -2.65645542          F
48                       Female  0.40426832          F
49                   Transwoman -0.27878877          Q
50                        Woman -0.09465904          F

Data:

set.seed(42)
df <- data.frame(
  gender=GenderSex, ## from OP
  x=rnorm(length(GenderSex))
)
df <- df[sample.int(nrow(df), 50, replace=TRUE), ] |> `rownames<-`(NULL)

Upvotes: 1

Rui Barradas
Rui Barradas

Reputation: 76575

Use an auxiliary function, a vectorized form of grepl followed by a logical value per row.
Thanks to r2evans for the suggestion of having a default value.

library(dplyr)

f <- c("Female/woman", "female", "Female cisgender", "Female", 
       "Woman", "woman", "Women", "women", "f", "F" )
m <- c("male", "Cis Male", "Male", "m", "M", "ma,e=]]")
gq <- c("genderqueer", "nonbinary", "genderfluid")

fun <- function(x, pattern) {
  Grepl <- Vectorize(grepl, "pattern")
  out <- Grepl(pattern, x)
  rowSums(out) > 0L
}

df <- data.frame(GenderSex)

df |> 
  mutate(GenderNew = case_when(
    fun(GenderSex, f) ~ "F",
    fun(GenderSex, m) ~ "M",
    fun(GenderSex, gq) ~ "Q",
    .default = GenderSex
  ))

Upvotes: 2

Related Questions