bio_r_learner
bio_r_learner

Reputation: 25

Parsing names in mixed formats using R

I have a list of names in mixed formats, which I would like to separate into columns containing first and last names in R. An example dataset:

Names <- c("Mary Smith","Hernandez, Maria","Bonds, Ed","Michael Jones")

The goal is to assemble a dataframe that contains names in a format like this:

FirstNames <- c("Mary","Maria","Ed","Michael")
LastNames <- c("Smith","Hernandez","Bonds","Jones")
FinalData <- data.frame (FirstNames, LastNames)

I tried a few approaches to select either the First or Last name based on whether the names are separated by a space only versus comma-space. For instance I wanted to use regular expressions in gsub to copy first names from rows in which a comma-space separates the names:

FirstNames2 <- gsub (".*,\\s","",Names)

This worked for rows that contained names in the LastName, FirstName format, but gsub collected the entire contents in rows with names in FirstName LastName format.

So my request for help is to please advise: How would you tackle this problem? Thanks to all in advance!

Upvotes: 1

Views: 380

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269674

Here is a one-liner. The pattern tries Firstname lastname first and if that fails it tries lastname, firstname. No packages are used.

read.table(text = sub("(\\w+) (\\w+)|(\\w+), (\\w+)", "\\1\\4 \\2\\3", Names), as.is=TRUE)

giving:

       V1        V2
1    Mary     Smith
2   Maria Hernandez
3      Ed     Bonds
4 Michael     Jones

Upvotes: 4

d.b
d.b

Reputation: 32548

temp = strsplit(x = Names, split = "(, | )")
do.call(rbind, lapply(1:length(temp), function(i){
    if (grepl(pattern = ", ", x = Names[i])){
        data.frame(F = temp[[i]][2], L = temp[[i]][1])
    }else{
        data.frame(F = temp[[i]][1], L = temp[[i]][2])
    }
}))
#        F         L
#1    Mary     Smith
#2   Maria Hernandez
#3      Ed     Bonds
#4 Michael     Jones

Upvotes: 0

drmariod
drmariod

Reputation: 11762

You could rearrange the , version to first last name and then just strsplit.

FirstNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 1)
LastNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 2)

Upvotes: 1

Related Questions