Absolute_Human
Absolute_Human

Reputation: 23

How to identify gender given a character string?

I have a dataframe such that...

df_example<- data.frame(
Name_example = c("John L. Smith", "Mary C. Boregart", "Alphusia D. Oregeno", "Mike A. Doo-doo-butt"),
Profile_example = c("Mr. Smith advises U.S. clients in complying with finger-paint laws. Mr. Smith received his J.D. from fake law school.", 
"Her experience includes identifying green and blue trains", 
"She helps investors seek out the boogeyman. Alphusia also knows karate.", 
"Mike specializes on the full spectrum of sandcastles"))

I would like to make a new column identifying the gender of each name base on the information in the profile.

I have attempted something like this...

ifelse(str_detect(df_example$Profile_example, "she|She|her|Her"), gender<-"F", gender<-"M" )

Which only saves the entry for the first name, ("M") instead of all 4.

How would you go about solving this problem? What if you had 100s of names and each profile was several paragraphs in length?

Upvotes: 1

Views: 347

Answers (2)

Edward
Edward

Reputation: 18798

One option, if the profile doesn't include a pronoun (he/she him/her etc.), is to use the babynames dataset from the package of the same name. The dataset contains millions of baby names born in the US together with the sex and the proportion (the proportion of people of that gender with that name born in that year).

First, create the firstnames for each person.

df_example$name <- sub("([A-Za-z]+).*", "\\1", df_example$Name_example)

Then summarise the babynames to get one unique name (the most common if it is shared by both sexes).

library(babynames)
librayr(dplyr)

Babies <- babynames %>%
  group_by(sex, name) %>%
  summarise(prop=mean(prop)) %>%  # Mean over all years
  group_by(name) %>%
  arrange(-prop) %>%  # need to arrange by prop descending 
  slice(1) %>% select(-prop)  # and then select the most common sex for each name

The code above is used to get the most common sex for one unique name for names that are common to both sexes (Riley, Jordan, etc.).

Then we inner join the example dataset with this summarised baby names data.

library(dplyr)
left_join(df_example, Babies, by="name")
      name  sex
1     John    M
2     Mary    F
3 Alphusia <NA>
4     Mike    M

We see that the sex of "Mike" is male but "Alphusia" is too uncommon to ascertain.

Upvotes: 1

Chris
Chris

Reputation: 535

Using grepl in combination with ifelse should do the job:

gender <- ifelse(grepl("she|She|her|Her", df_example$Profile_example), "F", "M" )

Performance should be okay even with very large datasets.

Upvotes: 1

Related Questions