Reputation: 520
Background
Below is my gamedata dataset in dput form -- it contains line scores for some MLB games.
structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000",
"(10)1140006x", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475,
18476, 18487, 18489, 18494), class = "Date")), class = "data.frame", row.names = c(NA,
-5L))
For example, here is one line score: "002200010".
Some line scores end with an 'x' and some have a two-digit value in parentheses, as in "00000(11)01x". Each digit not in parentheses indicates how many runs the team scored in the inning. If a team scores more than nine runs in an inning, the number is placed in parentheses, so in the line score "00000(11)01x" the team scored 11 runs in the sixth inning and did not come to bat in the bottom of the ninth (shown by the 'x').
Not every line score has nine innings. Some have more and some have as few as six.
What I need to do
First, what I need to do is to get how many runs a team scored in each inning, e.g., the first, second, third, and so on, and put each runs scored in a new column. I would prefer for the solution to use dplyr.
I have reviewed stackoverflow's suggested solutions, but found none that matched what I need. If there is one, I would appreciate it if you would share its URL.
I have tried to do it using this code:
gamedata %>%
select(ondate, team, linescore) %>%
mutate(inng1 = str_extract(linescore, "\\d|\\(\\d{2}\\)"))
Here is the output:
ondate team linescore inng1
2020-08-01 NYM 010000000 0
2020-08-02 NYM (10)1140006x (10)
2020-08-13 BOS 002200010 0
2020-08-15 NYM 00000(11)01x 0
2020-08-20 BOS 311200 3
Second, how can I remove the parentheses in the inng1 column for '10'?
The code below produced the error beneath it:
gamedata %>%
select(ondate, team, linescore) %>%
mutate(inng1 = str_extract(linescore, "\\d|\\(\\d{2}\\)"))
str_remove_all(inng1,"[()]")
This is the error message I got:
"Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), : object 'inng1' not found"
Third, I need to know how to extract the runs scored for each additional inning, starting with the second, putting each value in its own column, e.g, inng2, inng3, and so one.
At the end, I should have the output shown above (without parentheses for each double-digit inning) with a column for each inning, so there would be a column titled "inng1," "inng2," inng3", "inng4", and so on. The data in the inning columns needs to be numeric as later I will be summing them.
Upvotes: 3
Views: 252
Reputation: 26218
Something like this will do-
gsub
for some regex
transformationstringr::str_trim
and stringr::str_count()
(that is optional though)tidyr::separate
dplyr::mutate
.Steps-
x
from string linescore
(I mutate this into a new column, you may mutate existing column also)tidyr::separate
to separate string into a different column each.convert = TRUE
to convert each string into numeric.For regex transformation explanation check this
library(tidyverse)
df <- structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000",
"(10)1140006x", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475,
18476, 18487, 18489, 18494), class = "Date")), class = "data.frame", row.names = c(NA,
-5L))
df %>%
mutate(inn = gsub('x', '', linescore),
inn = str_trim(gsub("(.)(?![^(]*\\))", "\\1 ", inn, perl=TRUE)),
inn = gsub('\\(|\\)', '', inn),
innings_count = 1 + str_count(inn, ' ')) %>%
separate(inn, into = paste0('innings_', seq(max(.$innings_count))), sep = ' ', fill = 'right', convert = TRUE)
#> team linescore ondate innings_1 innings_2 innings_3 innings_4
#> 1 NYM 010000000 2020-08-01 0 1 0 0
#> 2 NYM (10)1140006x 2020-08-02 10 1 1 4
#> 3 BOS 002200010 2020-08-13 0 0 2 2
#> 4 NYM 00000(11)01x 2020-08-15 0 0 0 0
#> 5 BOS 311200 2020-08-20 3 1 1 2
#> innings_5 innings_6 innings_7 innings_8 innings_9 innings_count
#> 1 0 0 0 0 0 9
#> 2 0 0 0 6 NA 8
#> 3 0 0 0 1 0 9
#> 4 0 11 0 1 NA 8
#> 5 0 0 NA NA NA 6
Upvotes: 4
Reputation: 21908
Solution02
Here is another solution you can use for this question which is much more efficient than the first one and is predominantly based on purrr
family of functions:
library(dplyr)
library(purrr)
df %>%
bind_cols(
map(df %>% select(linescore), ~ strsplit(.x, "\\(|\\)")) %>%
flatten() %>%
map_dfr(~ map(.x, ~ if(nchar(.x) > 2) strsplit(.x, "")[[1]] else .x) %>%
reduce(~ c(.x, .y)) %>%
keep(~ nchar(.x) != 0) %>% t() %>%
as_tibble() %>%
set_names(~ paste0("inng", 1:length(.x)))) %>%
mutate(across(everything(), ~ replace(.x, .x == "x", NA_character_)),
count_inng = pmap_dbl(cur_data(), ~ sum(!is.na(c(...)))),
sums_inng = pmap_dbl(select(cur_data(), starts_with("inng")),
~ sum(as.numeric(c(...)), na.rm = TRUE)))
)
team linescore ondate inng1 inng2 inng3 inng4 inng5 inng6 inng7 inng8 inng9 count_inng
1 NYM 010000000 2020-08-01 0 1 0 0 0 0 0 0 0 9
2 NYM (10)1140006x 2020-08-02 10 1 1 4 0 0 0 6 <NA> 8
3 BOS 002200010 2020-08-13 0 0 2 2 0 0 0 1 0 9
4 NYM 00000(11)01x 2020-08-15 0 0 0 0 0 11 0 1 <NA> 8
5 BOS 311200 2020-08-20 3 1 1 2 0 0 <NA> <NA> <NA> 6
sums_inng
1 1
2 22
3 5
4 12
5 7
Solution01
I have made some modifications to my solution as it incorrectly replaced the double-digit number in the output vector and I think it has been fixed. I think this solution may help you. For this purpose I decided to write a custom function to detect two-digits numbers and also trim the output of your scores:
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
fn <- function(x) {
out <- c()
if(str_detect(x, "\\((\\d){2}\\)")) {
double <- str_replace_all(str_extract(x, "\\((\\d){2}\\)"), "[)()]", "")
ind <- str_locate(x, "\\(")
x <- str_remove(x, "\\((\\d){2}\\)")
out <- c(out, str_split(x, "")[[1]])
out[(ind[1, 1]+1):(length(out)+1)] <- out[(ind[1, 1]):length(out)]
out[ind] <- double
} else {
out <- c(out, str_split(x, "")[[1]])
}
if(any(grepl(")", out))) {
out <- out[-which(out == ")")]
}
out
}
# Test
fn("(10)1140006x)")
[1] "10" "1" "1" "4" "0" "0" "0" "6" "x"
Then we apply it on our data set in a row-wise operation:
df %>%
mutate(linescore = map(linescore, fn)) %>%
unnest_wider(linescore) %>%
rename_with(~ gsub("(\\.\\.\\.)(\\d)", paste0("inng", "\\2"), .), starts_with("...")) %>%
mutate(across(starts_with("inng"), ~ {replace(.x, .x == "x", NA)
as.numeric(.x)}),
inns_count = pmap_dbl(select(cur_data(), starts_with("inng")),
~ sum(!is.na(c(...)))),
inns_sums = pmap_dbl(select(cur_data(), starts_with("inng")),
~ sum(c(...), na.rm = TRUE)))
# A tibble: 5 x 13
team inng1 inng2 inng3 inng4 inng5 inng6 inng7 inng8 inng9 ondate inns_count inns_sums
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <dbl>
1 NYM 0 1 0 0 0 0 0 0 0 2020-08-01 9 1
2 NYM 10 1 1 4 0 0 0 6 NA 2020-08-02 8 22
3 BOS 0 0 2 2 0 0 0 1 0 2020-08-13 9 5
4 NYM 0 0 0 0 0 11 0 1 NA 2020-08-15 8 12
5 BOS 3 1 1 2 0 0 NA NA NA 2020-08-20 6 7
Upvotes: 5