ThanksABundle
ThanksABundle

Reputation: 385

R efficiency challenge: Splitting a long character vector

The problem is to efficiently parse data of this format:

lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

into a dataframe of two columns; one for the position, and one for the player.

The names are baseball players, and each name is prefaced with their position, which is the exact set {C, P, P, OF, 3B, SS, 1B, OF, 2B, OF} in some order. That is, those exact positions always occur.

For example, "C James McCann" should turn into

data.frame(position = "C", player = "James McCann")

In reality, I have many hundreds of thousands of such strings, and I want to parse them efficiently. Here is my inefficient solution:

data.frame(
    position = str_match_all(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]] %>% as.character() %>% str_trim(),
    player = str_split(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]][-1],
    stringsAsFactors = F
)

This tidyverse solution is simple, but I suspect I can do much better. Does anyone have any ideas?

Upvotes: 4

Views: 101

Answers (3)

Uwe
Uwe

Reputation: 42564

Here is a solution which converts lineup into a string in csv file format which is then read by fread():

library(magrittr)  # piping used to improve readability
lineup %>% 
  stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))
    position            player
 1:        C      James McCann
 2:        P        Robbie Ray
 3:        P    Rafael Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

The "trick" is to put a line break in front of the position characters and a column separator after, e.g., " C " becomes "\nC;".

lineup %>% 
  stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;")

returns

[1] "\nC;James McCann\nP;Robbie Ray\nP;Rafael Montero\nOF;Giancarlo  Stanton\n3B;Derek Dietrich\nSS;Miguel Rojas\n1B;Tommy Joseph\nOF;Marcell Ozuna\n2B;C?sar Hern?ndez\nOF;Christian Yelich"

This approach does not make many assumptions about the names. It even works with names like James P. McCann or Robbie Ray, Jr.

lineup2 %>% 
  stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))
    position            player
 1:        C   James P. McCann
 2:        P    Robbie Ray, Jr
 3:        P  Rafael D Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

There are three prerequisites which must be fulfilled:

  1. The name part must not contain any initials which are also used as position indicators, e.g., initials C and P must be completed by a dot to avoid confusion.
  2. The column separator ; must not be used elsewhere in lineup.
  3. The string must start with a leading space.

Condition 3 can be waved with an improved regular expression and condition 2 can be checked for:

lineup3 %T>% 
  {stopifnot(!stringr::str_detect(., ";"))} %>% 
  stringr::str_replace_all("(^\\s?|\\s)(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\2;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))
    position            player
 1:        C   James P. McCann
 2:        P    Robbie Ray, Jr
 3:        P    Rafael Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

Data

# original
lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

# other use cases
lineup1 = "C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2 = " C James P. McCann P Robbie Ray, Jr P Rafael D Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2a = " C James P. McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2b = " C James McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup3 = "C James P. McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup4 = " C James P. McCann P Robbie Ray; Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

Upvotes: 2

Maurits Evers
Maurits Evers

Reputation: 50708

Here is a stringr::str_split option, using a positive look-behind and look-ahead

pos <- c("C", "P", "P", "OF", "3B", "SS", "1B", "OF", "2B", "OF")
pat <- sprintf("(%s)", paste(pos, collapse = "|"))

library(stringr)
matrix(unlist(str_split(trimws(lineup), sprintf(
    "((?<=(%s))\\s|\\s(?=(%s)))", pat, pat))), ncol = 2, byrow = T)
#    [,1] [,2]
#[1,] "C"  "James McCann"
#[2,] "P"  "Robbie Ray"
#[3,] "P"  "Rafael Montero"
#[4,] "OF" "Giancarlo Stanton"
#[5,] "3B" "Derek Dietrich"
#[6,] "SS" "Miguel Rojas"
#[7,] "1B" "Tommy Joseph"
#[8,] "OF" "Marcell Ozuna"
#[9,] "2B" "C?sar Hern?ndez"
#[10,] "OF" "Christian Yelich"

I don't know how well this covers any edge cases. A more complex and representative sample string would be helpful for testing.

Upvotes: 2

IRTFM
IRTFM

Reputation: 263411

You could make a single pattern that would get you both the position and the player name with stringi::stri_match_all_regex:

stri_match_all_regex(lineup, 
                   patt= "(C|P|OF|3B|SS|1B|OF|2B) ([A-Z][A-Za-z]+ [A-Z][A-Za-z]+)" )
[[1]]
      [,1]                   [,2] [,3]               
 [1,] "C James McCann"       "C"  "James McCann"     
 [2,] "P Robbie Ray"         "P"  "Robbie Ray"       
 [3,] "P Rafael Montero"     "P"  "Rafael Montero"   
 [4,] "OF Giancarlo Stanton" "OF" "Giancarlo Stanton"
 [5,] "3B Derek Dietrich"    "3B" "Derek Dietrich"   
 [6,] "SS Miguel Rojas"      "SS" "Miguel Rojas"     
 [7,] "1B Tommy Joseph"      "1B" "Tommy Joseph"     
 [8,] "OF Marcell Ozuna"     "OF" "Marcell Ozuna"    
 [9,] "OF Christian Yelich"  "OF" "Christian Yelich" 

I made the pattern more restrictive than yours, since mine limits the one or two letters between spaces to only the combinations matching baseball positions. You are going to get a list with items that are matrices for each line. You should probably post a more complex example to support the further processing that will be needed. You will need to use something along the lines of lapply( results, function(x){ as.data.frame(x[ , -1]) })

lapply( results, function(x){ as.data.frame(x[ , -1]) })
[[1]]
  V1                V2
1  C      James McCann
2  P        Robbie Ray
3  P    Rafael Montero
4 OF Giancarlo Stanton
5 3B    Derek Dietrich
6 SS      Miguel Rojas
7 1B      Tommy Joseph
8 OF     Marcell Ozuna
9 OF  Christian Yelich

If there are going to be hyphenated names or middle names or initials then the pattern may need to be more complex.

Upvotes: 3

Related Questions