Pavel
Pavel

Reputation: 69

Splitting a complex string in R

I have a dataset that contains full names of people, and I have to split it into first, last, and other names. I was about to use tidyr::separate, but the thing is that the dataset contains strings of various lengths: some people have only two names (first & last), some three or even more, and some have complex names (e.g. Dutch 'van Gogh') or names with titles ('Brown CFA'). Any ideas how to go about that?

Extract:

> database$Full_Name[15:25]
 [1] "David Regan"                  "Izaque Iuzuru Nagata"        
 [3] "Christian Schmit de la Breli" "Peter Doyle"                 
 [5] "Hans R.Bruetsch"              "Marcus Reichel"              
 [7] "Per-Axel Koch"                "Louis Van der Walt"          
 [9] "Mario Adamek"                 "Ugur Tozsekerli"             
[11] "Judit Ludvai"  

Upvotes: 0

Views: 124

Answers (2)

Pavel
Pavel

Reputation: 69

install.packages("humanparser")

This one works best! Thanks, Oliver!

Upvotes: 1

hrbrmstr
hrbrmstr

Reputation: 78792

If you're willing to go with a brand-spankin new, V8-infused github package, this should work:

# YOU WILL NEED TO DO THIS FIRST!
# devtools::install_github("hrbrmstr/humanparser")

library(humanparser)

parse_name("John Smith Jr.")

## $firstName
## [1] "John"
## 
## $suffix
## [1] "Jr."
## 
## $lastName
## [1] "Smith"
## 
## $fullName
## [1] "John Smith Jr."


full_names <- c("David Regan", "Izaque Iuzuru Nagata",
                "Christian Schmit de la Breli", "Peter Doyle", "Hans R.Bruetsch",
                "Marcus Reichel", "Per-Axel Koch", "Louis Van der Walt",
                "Mario Adamek", "Ugur Tozsekerli", "Judit Ludvai" )


parse_names(full_names)

## Source: local data frame [11 x 4]
## 
##    firstName     lastName                     fullName middleName
## 1      David        Regan                  David Regan         NA
## 2     Izaque       Nagata         Izaque Iuzuru Nagata     Iuzuru
## 3  Christian  de la Breli Christian Schmit de la Breli     Schmit
## 4      Peter        Doyle                  Peter Doyle         NA
## 5       Hans   R.Bruetsch              Hans R.Bruetsch         NA
## 6     Marcus      Reichel               Marcus Reichel         NA
## 7   Per-Axel         Koch                Per-Axel Koch         NA
## 8      Louis Van der Walt           Louis Van der Walt         NA
## 9      Mario       Adamek                 Mario Adamek         NA
## 10      Ugur   Tozsekerli              Ugur Tozsekerli         NA
## 11     Judit       Ludvai                 Judit Ludvai         NA

It's based on this node.js module and uses the V8 package in the background to do all the dirty work with the parseName function from that module (yes, it's R calling JavaScript to Get 'er Done). Someone should really port that code to R at some point, though, since Python has a similar module.

Upvotes: 1

Related Questions