Reputation: 25
I have a list of names in mixed formats, which I would like to separate into columns containing first and last names in R. An example dataset:
Names <- c("Mary Smith","Hernandez, Maria","Bonds, Ed","Michael Jones")
The goal is to assemble a dataframe that contains names in a format like this:
FirstNames <- c("Mary","Maria","Ed","Michael")
LastNames <- c("Smith","Hernandez","Bonds","Jones")
FinalData <- data.frame (FirstNames, LastNames)
I tried a few approaches to select either the First or Last name based on whether the names are separated by a space only versus comma-space. For instance I wanted to use regular expressions in gsub to copy first names from rows in which a comma-space separates the names:
FirstNames2 <- gsub (".*,\\s","",Names)
This worked for rows that contained names in the LastName, FirstName format, but gsub collected the entire contents in rows with names in FirstName LastName format.
So my request for help is to please advise: How would you tackle this problem? Thanks to all in advance!
Upvotes: 1
Views: 380
Reputation: 269674
Here is a one-liner. The pattern tries Firstname lastname first and if that fails it tries lastname, firstname. No packages are used.
read.table(text = sub("(\\w+) (\\w+)|(\\w+), (\\w+)", "\\1\\4 \\2\\3", Names), as.is=TRUE)
giving:
V1 V2
1 Mary Smith
2 Maria Hernandez
3 Ed Bonds
4 Michael Jones
Upvotes: 4
Reputation: 32548
temp = strsplit(x = Names, split = "(, | )")
do.call(rbind, lapply(1:length(temp), function(i){
if (grepl(pattern = ", ", x = Names[i])){
data.frame(F = temp[[i]][2], L = temp[[i]][1])
}else{
data.frame(F = temp[[i]][1], L = temp[[i]][2])
}
}))
# F L
#1 Mary Smith
#2 Maria Hernandez
#3 Ed Bonds
#4 Michael Jones
Upvotes: 0
Reputation: 11762
You could rearrange the , version to first last name and then just strsplit
.
FirstNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 1)
LastNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 2)
Upvotes: 1