user10389226
user10389226

Reputation: 109

How to match the titles of a person using regular expressions

By using regular expressions to match the title. Write R snippet that creates a new column called “Female” and fills it with TRUE/FALSE values based on the text provided in the “Name” column. Like if it is "Miss" TRUE, if no salutation assign as "NA"

This is the data frame

df <- data.frame(PersonID=1:8, Name=c("Mr. Bob", "Ms. Blank", "Roger, Mr.", "MR Mark Simpson", "Miss Lisa", "Mrs. joshep", "Rakesh Kumar", "Kumar Gums Murphy"))

grepl("Miss", df, perl=TRUE)

output:

FALSE,FALSE,FALSE

expected output:

FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,NA,NA

Can anyone please help me on this?

Upvotes: 2

Views: 139

Answers (1)

JustGettinStarted
JustGettinStarted

Reputation: 834

If you want the NA for non-specified you have to first rule out that other designations are not present. That is, just because "Miss" is not present doesn't mean "Mr" or "MISS" are not.

The following will assign "M","F" or NA in your example. Please add designation as needed.

Titles <- c("Miss", "Ms","Mr","Mrs","MR","MS","MRS","MISS") # vector of possible titles
f.Titles <- c("Miss", "Ms","Mrs","MS","MRS","MISS") # vector of female specific titles
check <- NULL
for(i in 1:length(Titles)){
  check <- cbind(check,grepl(Titles[i], df$Name, perl=TRUE))
}

colnames(check) <- Titles
apply(check,1,function(x)ifelse(!any(x),NA,
                                ifelse(any(names(which(x)) %in% f.Titles),"F","M")))

Output :

[1] "M" "F" "M" "M" "F" "F" NA  NA 

From there its a simple

G <- apply(check,1,function(x)ifelse(!any(x),NA,
                                     ifelse(any(names(which(x)) %in% f.Titles),"F","M")))

df$Female <- ifelse(G=="F",TRUE,ifelse(is.na(G),NA,FALSE))
df
  PersonID              Name Female
1        1           Mr. Bob  FALSE
2        2         Ms. Blank   TRUE
3        3        Roger, Mr.  FALSE
4        4   MR Mark Simpson  FALSE
5        5         Miss Lisa   TRUE
6        6       Mrs. joshep   TRUE
7        7      Rakesh Kumar     NA
8        8 Kumar Gums Murphy     NA

Edit 1 :

Here is a more efficient version that does exactly what you asked for. Still need to specify all possible Titles and female titles (f.Titles)

check <- apply(as.matrix(Titles), 1, function(x) grepl(x, df$Name, perl=TRUE))
colnames(check) <- Titles
df$Female <- apply(check,1,function(x)ifelse(!any(x),NA,ifelse(any(names(which(x)) %in% f.Titles),TRUE,FALSE)))

Upvotes: 1

Related Questions