Reputation: 175
OK - just as the title states, I'm trying to assign a value (in this case, male or female) to a column based on a partial match in another dataframe column.
I have a large database with customer names (db$customer) and a dataframe with 2000 of the most common boys and girls names (gender$name) and the gender (gender$gender). I'd like to, using a partial match, so something like
grepl()
,assign a value to the database dataframe (db$gender)
here is some quick reproducible code:
customer <- c("Smith, Mary", "Johnson, Jack", "Doe, Jane", "Bronson, Charles", "Huckstable, Cliff", "Simpson, Jessica", "Simpson, Marge", "Mathews, Dave")
db <- as.data.frame(customer)
name <- c("Jack", "Charles", "Cliff", "Dave", "Mary", "Jane", "Jessica", "Marge")
gender <- c("M", "M", "M", "M", "F", "F", "F", "F")
gender <- as.data.frame(cbind(name, gender))
Again, I'm looking to append the db file with a gender column indicating whether the customer name is a boy or girl's name.
Thank you so much for any all help! P.S. I searched the knowledge base, and maybe it's because I don't know what to really look for, could not find any existing help threads.
Thanks again!
Upvotes: 0
Views: 834
Reputation: 7654
Here is another direction, but it is incomplete. It uses the name2sex
function of the qdap
package. However, I am not sure how to satisfy the database argument.
require(qdap)
db$customer <- as.character(db$customer)
first.names <- sapply(strsplit(db$customer, split = ","), "[", 2) # to get the first name only
first.names <- unlist(strip(first.names)) # to trim leading and trailing spaces, etc.
name2sex(names.list = first.names, database = qdapDictionaries::NAMES_SEX) # needs more
Upvotes: 1
Reputation: 17611
Since the first names in gender
and db
are exact matches, we can use merge
after splitting the customer names into first and last names.
db2 <- data.frame(do.call(rbind, strsplit(as.character(db$customer), ", +")))
merge(db2, gender, by.x="X2", by.y="name", all.x=TRUE)
X2 X1 gender
1 Charles Bronson M
2 Cliff Huckstable M
3 Dave Mathews M
4 Jack Johnson M
5 Jane Doe F
6 Jessica Simpson F
7 Marge Simpson F
8 Mary Smith F
I have the "+" in ", +"
in the strsplit
call because of the multiple spaces in
"Huckstable, Cliff"
. It is otherwise unnecessary.
Upvotes: 0
Reputation: 973
This should do what you want.
db$gender <- sapply(db$customer,FUN=function(x){
for(i in 1:nrow(gender)){
if (grepl(gender$name[i],x)){
return(gender$gender[i])
}
}
return(NA)
})
Upvotes: 1