gh0strider18
gh0strider18

Reputation: 175

R: how to match partial string in one dataframe from another, and based on condition, assign a value?

OK - just as the title states, I'm trying to assign a value (in this case, male or female) to a column based on a partial match in another dataframe column.

I have a large database with customer names (db$customer) and a dataframe with 2000 of the most common boys and girls names (gender$name) and the gender (gender$gender). I'd like to, using a partial match, so something like

grepl()

,assign a value to the database dataframe (db$gender)

here is some quick reproducible code:

customer <- c("Smith, Mary", "Johnson, Jack", "Doe, Jane", "Bronson, Charles", "Huckstable,     Cliff", "Simpson, Jessica", "Simpson, Marge", "Mathews, Dave")
db <- as.data.frame(customer)
name <- c("Jack", "Charles", "Cliff", "Dave", "Mary", "Jane", "Jessica", "Marge")
gender <- c("M", "M", "M", "M", "F", "F", "F", "F")
gender <- as.data.frame(cbind(name, gender))

Again, I'm looking to append the db file with a gender column indicating whether the customer name is a boy or girl's name.

Thank you so much for any all help! P.S. I searched the knowledge base, and maybe it's because I don't know what to really look for, could not find any existing help threads.

Thanks again!

Upvotes: 0

Views: 834

Answers (3)

lawyeR
lawyeR

Reputation: 7654

Here is another direction, but it is incomplete. It uses the name2sex function of the qdap package. However, I am not sure how to satisfy the database argument.

require(qdap)
db$customer <- as.character(db$customer)
first.names <- sapply(strsplit(db$customer, split = ","), "[", 2) # to get the first name only
first.names <- unlist(strip(first.names)) # to trim leading and trailing spaces, etc.
name2sex(names.list = first.names, database = qdapDictionaries::NAMES_SEX) # needs more

Upvotes: 1

Jota
Jota

Reputation: 17611

Since the first names in gender and db are exact matches, we can use merge after splitting the customer names into first and last names.

db2 <- data.frame(do.call(rbind, strsplit(as.character(db$customer), ", +")))

merge(db2, gender, by.x="X2", by.y="name", all.x=TRUE)

       X2         X1 gender
1 Charles    Bronson      M
2   Cliff Huckstable      M
3    Dave    Mathews      M
4    Jack    Johnson      M
5    Jane        Doe      F
6 Jessica    Simpson      F
7   Marge    Simpson      F
8    Mary      Smith      F

I have the "+" in ", +" in the strsplit call because of the multiple spaces in "Huckstable, Cliff". It is otherwise unnecessary.

Upvotes: 0

ChrKoenig
ChrKoenig

Reputation: 973

This should do what you want.

db$gender <- sapply(db$customer,FUN=function(x){
  for(i in 1:nrow(gender)){
    if (grepl(gender$name[i],x)){
      return(gender$gender[i])
    } 
  }
  return(NA)
})

Upvotes: 1

Related Questions