Reputation: 21
I'm new to R - so I'm sorry if this has been asked but I've not found a solution online. I have a data set of survey responses related to gender and sex that were typed in by 350 participants. Many of the responses are the same thing but typed/spelled differently. Below are some of the outcomes I get when I run "unique(df$variable)". There is a lot of variation, misspelling, differences in capitalizations, etc.
[1] Male Female
[3] female Female/woman
[5] Female F
[7] female Woman
[9] Cis female, she her Female cisgender
[11] Female heterosexual I identify as a trans woman!
[13] Demiboy Transwoman
[15] My sex is female and my gender identity is nonbinary male
[17] m woman
[19] Woman Nonbinary
[21] my gender doesn't exist Male/AMAB
What I've done: I have tried classifying all unique values and replacing with mutate:
f <- c("Female/woman", "female", "Female cisgender", "Female", "Woman", "woman", "Women", "women", "f", "F" )
m <- c("male", "Cis Male", "Male", "m", "M", "ma,e=]]")
gq <- c("genderqueer", "nonbinary", "genderfluid")
df |>
mutate(GenderNew = case_when(
GenderSex %in% f ~ "F",
GenderSex %in% m ~ "M",
GenderSex %in% gq ~ "Q",
)) -> df_new
But this gave me multiple NA in my GenderNew column. I've also not had success with grepl.
What I'm looking to do: Replace all occurrences of the letter "f" regardless of position in the string/participant response with the letter "F". Same with responses indicating male or genderqueer. I would like my outcome to be either "M", "F", "GQ", or whatever the participant typed as a response so I can re-code it without missing anything.
GenderSex <-
c("Male", "Female", "female", "Female/woman", "Female", "F",
"female", "Woman", "Cis female, she her", "Female cisgender",
"Female heterosexual", "I identify as a trans woman!", "Demiboy",
"Transwoman", "My sex is female and my gender identity is nonbinary",
"male", "m", "woman", "Woman", "Nonbinary", "my gender doesn't exist",
"Male/AMAB")
Upvotes: 2
Views: 92
Reputation: 73272
I like your initial approach better. I don't suppose your gender variable has millions of expressions, so dividing the unique
and sort
ed values by hand should be a safe option. dput
prepares a vector on the console you can copy.
> df$gender |> unique() |> sort() |> dput()
c("Cis female, she her", "Demiboy", "F", "female", "Female",
"Female cisgender", "Female heterosexual", "Female/woman", "I identify as a trans woman!",
"m", "male", "Male", "Male/AMAB", "my gender doesn't exist",
"Nonbinary", "Transwoman", "woman", "Woman")
Since we want F, M, and others (Q), we just need the first two guys.
> f <- c("Cis female, she her", "F", "female", "Female", "Female cisgender",
+ "Female heterosexual", "Female/woman", "woman", "Woman")
> m <- c("m", "male", "Male", "Male/AMAB")
Then just replace
three times. It is easy to read and more efficient than ifelse
or the like.
> df |>
+ transform(gender_new=replace(gender, gender %in% f, 'F')) |>
+ transform(gender_new=replace(gender_new, gender %in% m, 'M')) |>
+ transform(gender_new=replace(gender_new, !gender %in% c(f, m), 'Q'))
gender x gender_new
1 Woman -0.09465904 F
2 Female/woman 0.63286260 F
3 Female/woman 0.63286260 F
4 Male/AMAB -1.78130843 M
5 woman -2.65645542 F
6 Demiboy -1.38886070 Q
7 Female 0.40426832 F
8 Female/woman 0.63286260 F
9 Female -0.56469817 F
10 woman -2.65645542 F
11 female 0.36312841 F
12 m -0.28425292 M
13 my gender doesn't exist -0.30663859 Q
14 woman -2.65645542 F
15 F -0.10612452 F
16 F -0.10612452 F
17 Female -0.56469817 F
18 Nonbinary 1.32011335 Q
19 female 0.36312841 F
20 Male/AMAB -1.78130843 M
21 my gender doesn't exist -0.30663859 Q
22 Female -0.56469817 F
23 F -0.10612452 F
24 Female cisgender -0.06271410 F
25 Woman -0.09465904 F
26 Female 0.40426832 F
27 Male 1.37095845 M
28 m -0.28425292 M
29 female 1.51152200 F
30 Female/woman 0.63286260 F
31 Demiboy -1.38886070 Q
32 Female cisgender -0.06271410 F
33 Cis female, she her 2.01842371 F
34 I identify as a trans woman! 2.28664539 Q
35 Nonbinary 1.32011335 Q
36 Cis female, she her 2.01842371 F
37 Female heterosexual 1.30486965 F
38 female 0.36312841 F
39 male 0.63595040 M
40 Female 0.40426832 F
41 Transwoman -0.27878877 Q
42 Female 0.40426832 F
43 Male/AMAB -1.78130843 M
44 Female -0.56469817 F
45 woman -2.65645542 F
46 m -0.28425292 M
47 woman -2.65645542 F
48 Female 0.40426832 F
49 Transwoman -0.27878877 Q
50 Woman -0.09465904 F
Data:
set.seed(42)
df <- data.frame(
gender=GenderSex, ## from OP
x=rnorm(length(GenderSex))
)
df <- df[sample.int(nrow(df), 50, replace=TRUE), ] |> `rownames<-`(NULL)
Upvotes: 1
Reputation: 76575
Use an auxiliary function, a vectorized form of grepl
followed by a logical value per row.
Thanks to r2evans for the suggestion of having a default value.
library(dplyr)
f <- c("Female/woman", "female", "Female cisgender", "Female",
"Woman", "woman", "Women", "women", "f", "F" )
m <- c("male", "Cis Male", "Male", "m", "M", "ma,e=]]")
gq <- c("genderqueer", "nonbinary", "genderfluid")
fun <- function(x, pattern) {
Grepl <- Vectorize(grepl, "pattern")
out <- Grepl(pattern, x)
rowSums(out) > 0L
}
df <- data.frame(GenderSex)
df |>
mutate(GenderNew = case_when(
fun(GenderSex, f) ~ "F",
fun(GenderSex, m) ~ "M",
fun(GenderSex, gq) ~ "Q",
.default = GenderSex
))
Upvotes: 2