Reputation: 69
I have a dataset that contains full names of people, and I have to split it into first, last, and other names. I was about to use tidyr::separate, but the thing is that the dataset contains strings of various lengths: some people have only two names (first & last), some three or even more, and some have complex names (e.g. Dutch 'van Gogh') or names with titles ('Brown CFA'). Any ideas how to go about that?
Extract:
> database$Full_Name[15:25]
[1] "David Regan" "Izaque Iuzuru Nagata"
[3] "Christian Schmit de la Breli" "Peter Doyle"
[5] "Hans R.Bruetsch" "Marcus Reichel"
[7] "Per-Axel Koch" "Louis Van der Walt"
[9] "Mario Adamek" "Ugur Tozsekerli"
[11] "Judit Ludvai"
Upvotes: 0
Views: 124
Reputation: 78792
If you're willing to go with a brand-spankin new, V8-infused github package, this should work:
# YOU WILL NEED TO DO THIS FIRST!
# devtools::install_github("hrbrmstr/humanparser")
library(humanparser)
parse_name("John Smith Jr.")
## $firstName
## [1] "John"
##
## $suffix
## [1] "Jr."
##
## $lastName
## [1] "Smith"
##
## $fullName
## [1] "John Smith Jr."
full_names <- c("David Regan", "Izaque Iuzuru Nagata",
"Christian Schmit de la Breli", "Peter Doyle", "Hans R.Bruetsch",
"Marcus Reichel", "Per-Axel Koch", "Louis Van der Walt",
"Mario Adamek", "Ugur Tozsekerli", "Judit Ludvai" )
parse_names(full_names)
## Source: local data frame [11 x 4]
##
## firstName lastName fullName middleName
## 1 David Regan David Regan NA
## 2 Izaque Nagata Izaque Iuzuru Nagata Iuzuru
## 3 Christian de la Breli Christian Schmit de la Breli Schmit
## 4 Peter Doyle Peter Doyle NA
## 5 Hans R.Bruetsch Hans R.Bruetsch NA
## 6 Marcus Reichel Marcus Reichel NA
## 7 Per-Axel Koch Per-Axel Koch NA
## 8 Louis Van der Walt Louis Van der Walt NA
## 9 Mario Adamek Mario Adamek NA
## 10 Ugur Tozsekerli Ugur Tozsekerli NA
## 11 Judit Ludvai Judit Ludvai NA
It's based on this node.js module and uses the V8 package in the background to do all the dirty work with the parseName
function from that module (yes, it's R calling JavaScript to Get 'er Done). Someone should really port that code to R at some point, though, since Python has a similar module.
Upvotes: 1