Reputation: 385
The problem is to efficiently parse data of this format:
lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
into a dataframe of two columns; one for the position, and one for the player.
The names are baseball players, and each name is prefaced with their position, which is the exact set {C, P, P, OF, 3B, SS, 1B, OF, 2B, OF} in some order. That is, those exact positions always occur.
For example, "C James McCann" should turn into
data.frame(position = "C", player = "James McCann")
In reality, I have many hundreds of thousands of such strings, and I want to parse them efficiently. Here is my inefficient solution:
data.frame(
position = str_match_all(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]] %>% as.character() %>% str_trim(),
player = str_split(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]][-1],
stringsAsFactors = F
)
This tidyverse solution is simple, but I suspect I can do much better. Does anyone have any ideas?
Upvotes: 4
Views: 101
Reputation: 42564
Here is a solution which converts lineup
into a string in csv file format which is then read by fread()
:
library(magrittr) # piping used to improve readability
lineup %>%
stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>%
data.table::fread(header = FALSE, col.names = c("position", "player"))
position player 1: C James McCann 2: P Robbie Ray 3: P Rafael Montero 4: OF Giancarlo Stanton 5: 3B Derek Dietrich 6: SS Miguel Rojas 7: 1B Tommy Joseph 8: OF Marcell Ozuna 9: 2B C?sar Hern?ndez 10: OF Christian Yelich
The "trick" is to put a line break in front of the position characters and a column separator after, e.g., " C "
becomes "\nC;"
.
lineup %>%
stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;")
returns
[1] "\nC;James McCann\nP;Robbie Ray\nP;Rafael Montero\nOF;Giancarlo Stanton\n3B;Derek Dietrich\nSS;Miguel Rojas\n1B;Tommy Joseph\nOF;Marcell Ozuna\n2B;C?sar Hern?ndez\nOF;Christian Yelich"
This approach does not make many assumptions about the names. It even works with names like James P. McCann
or Robbie Ray, Jr
.
lineup2 %>%
stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>%
data.table::fread(header = FALSE, col.names = c("position", "player"))
position player 1: C James P. McCann 2: P Robbie Ray, Jr 3: P Rafael D Montero 4: OF Giancarlo Stanton 5: 3B Derek Dietrich 6: SS Miguel Rojas 7: 1B Tommy Joseph 8: OF Marcell Ozuna 9: 2B C?sar Hern?ndez 10: OF Christian Yelich
There are three prerequisites which must be fulfilled:
C
and P
must be completed by a dot to avoid confusion.;
must not be used elsewhere in lineup
.Condition 3 can be waved with an improved regular expression and condition 2 can be checked for:
lineup3 %T>%
{stopifnot(!stringr::str_detect(., ";"))} %>%
stringr::str_replace_all("(^\\s?|\\s)(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\2;") %>%
data.table::fread(header = FALSE, col.names = c("position", "player"))
position player 1: C James P. McCann 2: P Robbie Ray, Jr 3: P Rafael Montero 4: OF Giancarlo Stanton 5: 3B Derek Dietrich 6: SS Miguel Rojas 7: 1B Tommy Joseph 8: OF Marcell Ozuna 9: 2B C?sar Hern?ndez 10: OF Christian Yelich
# original
lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
# other use cases
lineup1 = "C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2 = " C James P. McCann P Robbie Ray, Jr P Rafael D Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2a = " C James P. McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2b = " C James McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup3 = "C James P. McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup4 = " C James P. McCann P Robbie Ray; Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
Upvotes: 2
Reputation: 50708
Here is a stringr::str_split
option, using a positive look-behind and look-ahead
pos <- c("C", "P", "P", "OF", "3B", "SS", "1B", "OF", "2B", "OF")
pat <- sprintf("(%s)", paste(pos, collapse = "|"))
library(stringr)
matrix(unlist(str_split(trimws(lineup), sprintf(
"((?<=(%s))\\s|\\s(?=(%s)))", pat, pat))), ncol = 2, byrow = T)
# [,1] [,2]
#[1,] "C" "James McCann"
#[2,] "P" "Robbie Ray"
#[3,] "P" "Rafael Montero"
#[4,] "OF" "Giancarlo Stanton"
#[5,] "3B" "Derek Dietrich"
#[6,] "SS" "Miguel Rojas"
#[7,] "1B" "Tommy Joseph"
#[8,] "OF" "Marcell Ozuna"
#[9,] "2B" "C?sar Hern?ndez"
#[10,] "OF" "Christian Yelich"
I don't know how well this covers any edge cases. A more complex and representative sample string would be helpful for testing.
Upvotes: 2
Reputation: 263411
You could make a single pattern that would get you both the position and the player name with stringi::stri_match_all_regex:
stri_match_all_regex(lineup,
patt= "(C|P|OF|3B|SS|1B|OF|2B) ([A-Z][A-Za-z]+ [A-Z][A-Za-z]+)" )
[[1]]
[,1] [,2] [,3]
[1,] "C James McCann" "C" "James McCann"
[2,] "P Robbie Ray" "P" "Robbie Ray"
[3,] "P Rafael Montero" "P" "Rafael Montero"
[4,] "OF Giancarlo Stanton" "OF" "Giancarlo Stanton"
[5,] "3B Derek Dietrich" "3B" "Derek Dietrich"
[6,] "SS Miguel Rojas" "SS" "Miguel Rojas"
[7,] "1B Tommy Joseph" "1B" "Tommy Joseph"
[8,] "OF Marcell Ozuna" "OF" "Marcell Ozuna"
[9,] "OF Christian Yelich" "OF" "Christian Yelich"
I made the pattern more restrictive than yours, since mine limits the one or two letters between spaces to only the combinations matching baseball positions. You are going to get a list with items that are matrices for each line. You should probably post a more complex example to support the further processing that will be needed. You will need to use something along the lines of lapply( results, function(x){ as.data.frame(x[ , -1]) })
lapply( results, function(x){ as.data.frame(x[ , -1]) })
[[1]]
V1 V2
1 C James McCann
2 P Robbie Ray
3 P Rafael Montero
4 OF Giancarlo Stanton
5 3B Derek Dietrich
6 SS Miguel Rojas
7 1B Tommy Joseph
8 OF Marcell Ozuna
9 OF Christian Yelich
If there are going to be hyphenated names or middle names or initials then the pattern may need to be more complex.
Upvotes: 3