Reputation: 1364
I have a FULLNAME column, I want to split it into 3 columns: LASTNAME, FIRSTNAME, MIDDLE_NAME_INITIAL Different cases are included in the example below. I think it is easy to look at my code than my description.
df <- data.frame(FULLNAME = c("John, Smith J.",
"David, Cameron",
"Adam-Steve, Johnson M.",
"Antonio, Zang-Chi K",
"Joan Philippe, Luis Carlos",
"Dave, Jr., Danny Rock",
"Jake, Joan-Anberto",
"Annie, L.K Selena",
"Anna, P. Zhei"))
LASTNAME FIRSTNAME MIDDLE_NAME_INITIAL
1 John Smith J.
2 David Cameron
3 Adam-Steve Johnson M.
4 Antonio Zang-Chi K
5 Joan Philippe Luis Carlos
6 Dave, Jr. Danny Rock
7 Jake Joan-Anberto
8 Annie Selena L.K
9 Anna Zhei P.
I have google things and I found this here
I tried different ways one of them is pattern = "(.+),\\s*(.+)\\s+(.+)"
, but it failed to get the expected output.
Every recommendation would be appreciated.
Upvotes: 1
Views: 435
Reputation: 1216
Requires PCRE-style regular expression support. So, yeah...
/
^ # start at the beginning of the string
(
\w+ # first name
(?:[- ]\w+)* # optional second part of first name
(?:,(?![^,]*$)\s[\w.]+)? # optional comma-separated addendum to 1st name
)
,\s # delimiting comma and space
(?= # assert existence of last name
.*? # bridge gap to last name (pre-initials)
(\w{2,}(?:-\w{2,})*) # (optionally multi-part) last name
)
(?= # assert existence of optional initials
(?:.*?\b(\w\.\w\b|\w\b\.?|(?<!-)\w+$))? # optional initals or middle name
)
/x # flag: enable free-spacing mode for expression
See demo.
I have no idea about R; this is just an example of how to capture the different name parts, so far as possible.
Edit: updated the expression to treat additional name parts like middle name initials.
Upvotes: 2
Reputation: 1892
Try this expression:
([\w\s.,-]+)(?:[^,]*,\s){1,}([\w.-]+)\s*([\w.-]*)
Here you can see how it works: https://regexr.com/50oef
I don't know R language, so let me show the example using Java:
List<String> items = Arrays.asList(
"John, Smith J.",
"David, Cameron",
"Adam-Steve, Johnson M.",
"Antonio, Zang-Chi K",
"Joan Philippe, Luis Carlos",
"Dave, Jr., Danny Rock",
"Jake, Joan-Anberto",
"Annie, L.K Selena",
"Anna, P. Zhei");
Pattern regex = Pattern.compile("([\\w\\s.,-]+)(?:[^,]*,\\s){1,}([\\w.-]+)\\s*([\\w.-]*)");
int k = 0;
for (String item : items) {
Matcher m = regex.matcher(item);
if (m.find()) {
String group1 = m.group(1);
String group2 = m.group(2);
String group3 = m.group(3);
boolean initialsInGroup2 = group2.contains(".");
boolean initialsInGroup3 = group3.contains(".");
System.out.println(++k
+ (!"".equals(group1) ? String.format("%15s", group1) : "")
+ (!"".equals(group2) ? String.format("%15s", initialsInGroup2 ? group3 : group2) : "")
+ (!"".equals(group3) ? String.format("%10s", initialsInGroup3 ? group3 : initialsInGroup2 ? group2 : group3) : ""));
}
}
Output:
1 John Smith J.
2 David Cameron
3 Adam-Steve Johnson M.
4 Antonio Zang-Chi K
5 Joan Philippe Luis Carlos
6 Dave, Jr. Danny Rock
7 Jake Joan-Anberto
8 Annie Selena L.K
9 Anna Zhei P.
Upvotes: 1
Reputation: 174348
Because your data is not in a fixed column-wise order, I think there is too much conditional logic to try to capture all this in a maintainable regex in R. I'm not even sure how you could tell which names are first names and which are middle names when initials are not used since the ordering is inconsistent.
However, based on the rules implied by how you have manually parsed the names, here is some code that can replicate these rules:
extract_initials <- function(x)
{
y <- lapply(strsplit(x, " "), function(z) z[nzchar(z)])
sapply(y, function(z){
if(length(z) == 1) return("")
else if(!all(grepl("[a-z]", z)))
return(paste(grep("[a-z]", z, invert = T, value = T), collapse = " "))
else return(paste(z[length(z)], collapse = " "))
})
}
extract_first <- function(x)
{
y <- lapply(strsplit(x, " "), function(z) z[nzchar(z)])
sapply(y, function(z){
if(length(z) == 1) return(z)
else if(!all(grepl("[a-z]", z)))
return(paste(grep("[a-z]", z, value = T), collapse = " "))
else return(paste(z[-length(z)], collapse = " "))
})
}
split_name <- function(x)
{
partlist <- strsplit(x, ",(?=[^,]*$)", perl = TRUE)
surnames <- sapply(partlist, `[`, 1)
forenames <- sapply(partlist, `[`, 2)
data.frame(surname = surnames,
first = extract_first(forenames),
middle = extract_initials(forenames),
stringsAsFactors = FALSE)
}
and it works as simply as this:
split_name(df$FULLNAME)
#> surname first middle
#> 1 John Smith J.
#> 2 David Cameron
#> 3 Adam-Steve Johnson M.
#> 4 Antonio Zang-Chi K
#> 5 Joan Philippe Luis Carlos
#> 6 Dave, Jr. Danny Rock
#> 7 Jake Joan-Anberto
#> 8 Annie Selena L.K
#> 9 Anna Zhei P.
Created on 2020-03-20 by the reprex package (v0.3.0)
Upvotes: 1