data_life
data_life

Reputation: 399

Split a string of names and transpose

I have a list of names (famous directors) that is in format of First, (possible middle), and Last Name which I need to rearrange to have Last Name, First (possible middle). I can't just split all of these by the first space, or even second space since some last names actually have two words and some have middle names and or middle initials that would stay following the first name.

Here is the dput for the list I'm working with:

> dput(directors.names)
c("Frank Darabont,", "Francis Ford Coppola,", "Francis Ford Coppola,", 
"Christopher Nolan,", "Sidney Lumet,", "Steven Spielberg,", "Peter Jackson,", 
"Quentin Tarantino,", "Sergio Leone,", "Peter Jackson,", "David Fincher,", 
"Robert Zemeckis,", "Christopher Nolan,", "Peter Jackson,", "Irvin Kershner,", 
"Lana Wachowski,", "Martin Scorsese,", "Milos Forman,", "Akira Kurosawa,", 
"David Fincher,", "Jonathan Demme,", "Fernando Meirelles,", "Roberto Benigni,", 
"Frank Capra,", "Steven Spielberg,", "George Lucas,", "Christopher Nolan,", 
"Hayao Miyazaki,", "Frank Darabont,", "Bong Joon Ho,", "Luc Besson,", 
"Masaki Kobayashi,", "Roman Polanski,", "James Cameron,", "Robert Zemeckis,", 
"Bryan Singer,", "Alfred Hitchcock,", "Roger Allers,", "Charles Chaplin,", 
"Tony Kaye,", "Isao Takahata,", "Charles Chaplin,", "Damien Chazelle,", 
"Ridley Scott,", "Martin Scorsese,", "Olivier Nakache,", "Christopher Nolan,", 
"Michael Curtiz,", "Sergio Leone,", "Alfred Hitchcock,", "Giuseppe Tornatore,", 
"Ridley Scott,", "Francis Ford Coppola,", "Christopher Nolan,", 
"Steven Spielberg,", "Charles Chaplin,", "Quentin Tarantino,", 
"Florian Henckel von Donnersmarck,", "Stanley Kubrick,", "Billy Wilder,", 
"Andrew Stanton,", "Anthony Russo,", "Billy Wilder,", "Stanley Kubrick,", 
"Bob Persichetti,", "Stanley Kubrick,", "Hayao Miyazaki,", "Park Chan-Wook,", 
"Todd Phillips,", "Makoto Shinkai,", "Lee Unkrich,", "Christopher Nolan,", 
"James Cameron,", "Sergio Leone,", "Anthony Russo,", "Nadine Labaki,", 
"Wolfgang Petersen,", "Akira Kurosawa,", "Rajkumar Hirani,", 
"John Lasseter,", "Sam Mendes,", "Milos Forman,", "Mel Gibson,", 
"Quentin Tarantino,", "Thomas Kail,", "Gus Van Sant,", "Richard Marquand,", 
"Stanley Kubrick,", "Quentin Tarantino,", "Elem Klimov,", "Fritz Lang,", 
"Aamir Khan,", "Alfred Hitchcock,", "Orson Welles,", "Thomas Vinterberg,", 
"Darren Aronofsky,", "Stanley Donen,", "Alfred Hitchcock,", "Michel Gondry,", 
"Akira Kurosawa,", "Vittorio De Sica,", "David Lean,", "Charles Chaplin,", 
"Stanley Kubrick,", "Nitesh Tiwari,", "Billy Wilder,", "Denis Villeneuve,", 
"Florian Zeller,", "Fritz Lang,", "Billy Wilder,", "Stanley Kubrick,", 
"Martin Scorsese,", "Asghar Farhadi,", "George Roy Hill,", "Brian De Palma,", 
"Satyajit Ray,", "Guy Ritchie,", "Sam Mendes,", "Jean-Pierre Jeunet,", 
"Robert Mulligan,", "Lee Unkrich,", "Sergio Leone,", "Pete Docter,", 
"Steven Spielberg,", "Michael Mann,", "Curtis Hanson,", "T.J. Gnanavel,", 
"Akira Kurosawa,", "John McTiernan,", "Akira Kurosawa,", "Akira Kurosawa,", 
"Peter Farrelly,", "Oliver Hirschbiegel,", "Terry Gilliam,", 
"Joseph L. Mankiewicz,", "Billy Wilder,", "Christopher Nolan,", 
"Clint Eastwood,", "Majid Majidi,", "Hayao Miyazaki,", "Martin Scorsese,", 
"Stanley Kramer,", "John Sturges,", "Paul Thomas Anderson,", 
"Martin Scorsese,", "John Huston,", "Guillermo del Toro,", "Ron Howard,", 
"Juan José Campanella,", "Martin Scorsese,", "Akira Kurosawa,", 
"Roman Polanski,", "Hayao Miyazaki,", "Guy Ritchie,", "Martin Scorsese,", 
"Ethan Coen,", "Charles Chaplin,", "Alfred Hitchcock,", "John Carpenter,", 
"Ingmar Bergman,", "Martin McDonagh,", "Sergio Pablos,", "David Lynch,", 
"M. Night Shyamalan,", "Ingmar Bergman,", "Peter Weir,", "Carol Reed,", 
"Steven Spielberg,", "Denis Villeneuve,", "Bong Joon Ho,", "James McTeigue,", 
"Ridley Scott,", "Danny Boyle,", "Pete Docter,", "David Lean,", 
"Joel Coen,", "Gavin O'Connor,", "Andrew Stanton,", "Quentin Tarantino,", 
"Victor Fleming,", "Yasujirô Ozu,", "Elia Kazan,", "Cagan Irmak,", 
"Damián Szifron,", "Andrei Tarkovsky,", "Michael Cimino,", "Denis Villeneuve,", 
"Costa-Gavras,", "Wes Anderson,", "Buster Keaton,", "Clyde Bruckman,", 
"Clint Eastwood,", "Ingmar Bergman,", "Richard Linklater,", "Adam Elliot,", 
"Steven Spielberg,", "Frank Capra,", "Jim Sheridan,", "Stanley Kubrick,", 
"Lenny Abrahamson,", "David Fincher,", "Mel Gibson,", "Carl Theodor Dreyer,", 
"Sriram Raghavan,", "James Mangold,", "Steve McQueen,", "Ernst Lubitsch,", 
"Joel Coen,", "Peter Weir,", "Ingmar Bergman,", "Dean DeBlois,", 
"George Miller,", "William Wyler,", "David Yates,", "Clint Eastwood,", 
"Henri-Georges Clouzot,", "Park Chan-Wook,", "Rob Reiner,", "Sidney Lumet,", 
"James Mangold,", "Anurag Kashyap,", "Stuart Rosenberg,", "Lasse Hallström,", 
"Mathieu Kassovitz,", "François Truffaut,", "Naoko Yamada,", 
"Oliver Stone,", "Tom McCarthy,", "Pete Docter,", "Alfred Hitchcock,", 
"Terry Jones,", "Terry George,", "Kar-Wai Wong,", "Yavuz Turgul,", 
"Ron Howard,", "Sean Penn,", "John G. Avildsen,", "Alejandro G. Iñárritu,", 
"Hayao Miyazaki,", "Andrei Tarkovsky,", "Frank Capra,", "Richard Linklater,", 
"Ingmar Bergman,", "Hideaki Anno,", "Gillo Pontecorvo,", "Federico Fellini,", 
"Rob Reiner,", "Wim Wenders,", "Krzysztof Kieslowski,", "Ram Kumar,"
)

Some of the tricky examples, I would need to split "John G. Avildsen" after the G., but then "Bong Joon Ho" after the first space, and even more so, "Florian Henckel von Donnersmarck" after the 2nd space (just to point out a couple).

I've added a comma to the end of all strings so that I can then transpose the strings and have it return Last Name, First (possible middle) format.

I went through my list and found all the situations where there is something that would need to remain with the last name portion to try and those ones split first, but it isn't splitting where I need it to, it's just splitting each string into it's own index.

Here is what I have tried most recently:

directors.names <- paste0(directors.1, ",")
directors.names <- strsplit(directors.names, "[[:space:]]+('von'|'Ford'|'Joon'|'De'|'del'|'Van')[[:space:]]+", perl = TRUE)  

Once these are split and transposed correctly, the duplicates need to be removed to return a list that can be alphabetically sorted by last name and each row showing Last Name, First Name (MI or Middle Name).

Upvotes: 2

Views: 102

Answers (1)

GuedesBF
GuedesBF

Reputation: 9878

We know the patterns to extract (first word, last word, and ocasioanally a two-word last name), so we may fare better with an extract rather than a split approach, because we do not know the number of words for every name (it would be difficult to split on the nth whitespace).

We can define a pattern for common two-word last names, then insert this pattern with glue::glue inside str_extract_all.

In the following call to str_extract_all, we definde 3 possible patterns to extract:

  • a first word ^\\w+
  • a two-word last name (({two_word_patterns})\\s+\\w+$)
  • a regular last name \\w+$

These three should be collapsed with | as the separator (1), all within the regex (no ticks in between).

After extracting the names, we can reverse the order with rev(), and, finally, paste them back together with toString.

toString is specifically useful when we need to paste character elements with a ", " separator, like in this case.

library(glue)
library(stringr)
libyrar(purrr)

directors<-c("Fernando Meireles", "Bong Joon Ho", "Florian Henckel von Donnersmarck")

two_word_patterns<-'(von)|(Ford)|(Joon)|(De)|(del)|(Van)'(1)

directors %>%
    str_extract_all(pattern = glue('^\\w+|(({two_word_patterns})\\s+\\w+$)|\\w+$'))%>%
    map(rev) %>%
    map_chr(toString)

[1] "Meireles, Fernando"        "Joon Ho, Bong"             "von Donnersmarck, Florian"

(1) If we had a vector of two-word last names and wanted to construct the 'two_word_patterns' programatically, we can use:

two_words_2<-c('von', 'Ford', 'Joon', 'De', 'del', 'Van')

two_words_2_pattern <- map_chr(two_words_2, ~glue('({.x})')) %>%
    paste(collapse = '|')

[1] "(von)|(Ford)|(Joon)|(De)|(del)|(Van)"

EDIT

-THE OP provided data with dput()

If we really must work with names with the added trailling comma (as in "Fernando Meirelles,", we can start by removing the comma before the operation, with trimws. Then pipe the output of trimws into the same code as above. Here I used just a subset of the data, for clarity:

head(directors_names, 40)%>%
    trimws(whitespace = ',') %>%
    str_extract_all(pattern = glue('^\\w+|(({(two_words_2_pattern)})\\s+\\w+$)|\\w+$')) %>%
    map(rev) %>%
    map_chr(toString)
 [1] "Darabont, Frank"       "Ford Coppola, Francis" "Ford Coppola, Francis" "Nolan, Christopher"    "Lumet, Sidney"         "Spielberg, Steven"    
 [7] "Jackson, Peter"        "Tarantino, Quentin"    "Leone, Sergio"         "Jackson, Peter"        "Fincher, David"        "Zemeckis, Robert"     
[13] "Nolan, Christopher"    "Jackson, Peter"        "Kershner, Irvin"       "Wachowski, Lana"       "Scorsese, Martin"      "Forman, Milos"        
[19] "Kurosawa, Akira"       "Fincher, David"        "Demme, Jonathan"       "Meirelles, Fernando"   "Benigni, Roberto"      "Capra, Frank"         
[25] "Spielberg, Steven"     "Lucas, George"         "Nolan, Christopher"    "Miyazaki, Hayao"       "Darabont, Frank"       "Joon Ho, Bong"        
[31] "Besson, Luc"           "Kobayashi, Masaki"     "Polanski, Roman"       "Cameron, James"        "Zemeckis, Robert"      "Singer, Bryan"        
[37] "Hitchcock, Alfred"     "Allers, Roger"         "Chaplin, Charles"      "Kaye, Tony"

Upvotes: 2

Related Questions