Mike V
Mike V

Reputation: 1364

Regular Expression Separate columns Fullname into Lastname, Firstname Middlename(or Initial) Format

I have a FULLNAME column, I want to split it into 3 columns: LASTNAME, FIRSTNAME, MIDDLE_NAME_INITIAL Different cases are included in the example below. I think it is easy to look at my code than my description.

df <- data.frame(FULLNAME = c("John, Smith J.", 
                          "David, Cameron", 
                          "Adam-Steve, Johnson M.", 
                          "Antonio, Zang-Chi K", 
                          "Joan Philippe, Luis Carlos", 
                          "Dave, Jr., Danny Rock",
                          "Jake, Joan-Anberto",
                          "Annie, L.K Selena",
                          "Anna, P. Zhei"))

Output:

       LASTNAME    FIRSTNAME MIDDLE_NAME_INITIAL
1          John        Smith             J.
2         David      Cameron               
3    Adam-Steve      Johnson             M.
4       Antonio     Zang-Chi              K
5 Joan Philippe         Luis         Carlos
6     Dave, Jr.        Danny           Rock
7          Jake Joan-Anberto               
8         Annie       Selena            L.K
9          Anna         Zhei             P.

I have google things and I found this here I tried different ways one of them is pattern = "(.+),\\s*(.+)\\s+(.+)" , but it failed to get the expected output. Every recommendation would be appreciated.

Upvotes: 1

Views: 435

Answers (3)

oriberu
oriberu

Reputation: 1216

Requires PCRE-style regular expression support. So, yeah...

/
^                               # start at the beginning of the string
(
  \w+                           # first name
  (?:[- ]\w+)*                  # optional second part of first name
  (?:,(?![^,]*$)\s[\w.]+)?      # optional comma-separated addendum to 1st name
)
,\s                             # delimiting comma and space
(?=                             # assert existence of last name
  .*?                           # bridge gap to last name (pre-initials)
  (\w{2,}(?:-\w{2,})*)          # (optionally multi-part) last name
)
(?=                             # assert existence of optional initials
  (?:.*?\b(\w\.\w\b|\w\b\.?|(?<!-)\w+$))?  # optional initals or middle name
)
/x                              # flag: enable free-spacing mode for expression

See demo.

I have no idea about R; this is just an example of how to capture the different name parts, so far as possible.

Edit: updated the expression to treat additional name parts like middle name initials.

Upvotes: 2

Ilya Lysenko
Ilya Lysenko

Reputation: 1892

Try this expression:

([\w\s.,-]+)(?:[^,]*,\s){1,}([\w.-]+)\s*([\w.-]*)

Here you can see how it works: https://regexr.com/50oef

I don't know R language, so let me show the example using Java:

List<String> items = Arrays.asList(
        "John, Smith J.",
        "David, Cameron",
        "Adam-Steve, Johnson M.",
        "Antonio, Zang-Chi K",
        "Joan Philippe, Luis Carlos",
        "Dave, Jr., Danny Rock",
        "Jake, Joan-Anberto",
        "Annie, L.K Selena",
        "Anna, P. Zhei");

Pattern regex = Pattern.compile("([\\w\\s.,-]+)(?:[^,]*,\\s){1,}([\\w.-]+)\\s*([\\w.-]*)");

int k = 0;
for (String item : items) {
    Matcher m = regex.matcher(item);

    if (m.find()) {
        String group1 = m.group(1);
        String group2 = m.group(2);
        String group3 = m.group(3);

        boolean initialsInGroup2 = group2.contains(".");
        boolean initialsInGroup3 = group3.contains(".");

        System.out.println(++k
                + (!"".equals(group1) ? String.format("%15s", group1) : "")
                + (!"".equals(group2) ? String.format("%15s", initialsInGroup2 ? group3 : group2) : "")
                + (!"".equals(group3) ? String.format("%10s", initialsInGroup3 ? group3 : initialsInGroup2 ? group2 : group3) : ""));
    }
}

Output:

1           John          Smith        J.
2          David        Cameron
3     Adam-Steve        Johnson        M.
4        Antonio       Zang-Chi         K
5  Joan Philippe           Luis    Carlos
6      Dave, Jr.          Danny      Rock
7           Jake   Joan-Anberto
8          Annie         Selena       L.K
9           Anna           Zhei        P.

Upvotes: 1

Allan Cameron
Allan Cameron

Reputation: 174348

Because your data is not in a fixed column-wise order, I think there is too much conditional logic to try to capture all this in a maintainable regex in R. I'm not even sure how you could tell which names are first names and which are middle names when initials are not used since the ordering is inconsistent.

However, based on the rules implied by how you have manually parsed the names, here is some code that can replicate these rules:

extract_initials <- function(x)
{
  y <- lapply(strsplit(x, " "), function(z) z[nzchar(z)])
  sapply(y, function(z){
    if(length(z) == 1) return("")
    else if(!all(grepl("[a-z]", z)))
      return(paste(grep("[a-z]", z, invert = T, value = T), collapse = " "))
    else return(paste(z[length(z)], collapse = " "))
  })
}

extract_first <- function(x)
{
  y <- lapply(strsplit(x, " "), function(z) z[nzchar(z)])
  sapply(y, function(z){
    if(length(z) == 1) return(z)
    else if(!all(grepl("[a-z]", z)))
      return(paste(grep("[a-z]", z, value = T), collapse = " "))
    else return(paste(z[-length(z)], collapse = " "))
  })
}

split_name <- function(x)
{
  partlist <- strsplit(x, ",(?=[^,]*$)", perl = TRUE)
  surnames <- sapply(partlist, `[`, 1)
  forenames <- sapply(partlist, `[`, 2)
  data.frame(surname = surnames, 
             first = extract_first(forenames), 
             middle = extract_initials(forenames),
             stringsAsFactors = FALSE)
}

and it works as simply as this:

split_name(df$FULLNAME)
#>         surname        first middle
#> 1          John        Smith     J.
#> 2         David      Cameron       
#> 3    Adam-Steve      Johnson     M.
#> 4       Antonio     Zang-Chi      K
#> 5 Joan Philippe         Luis Carlos
#> 6     Dave, Jr.        Danny   Rock
#> 7          Jake Joan-Anberto       
#> 8         Annie       Selena    L.K
#> 9          Anna         Zhei     P.

Created on 2020-03-20 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions