Need to extract individual characters from a string column using R

Question

Background

Below is my gamedata dataset in dput form -- it contains line scores for some MLB games.

structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000", 
"(10)1140006x", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475, 
18476, 18487, 18489, 18494), class = "Date")), class = "data.frame", row.names = c(NA, 
-5L))

For example, here is one line score: "002200010".

Some line scores end with an 'x' and some have a two-digit value in parentheses, as in "00000(11)01x". Each digit not in parentheses indicates how many runs the team scored in the inning. If a team scores more than nine runs in an inning, the number is placed in parentheses, so in the line score "00000(11)01x" the team scored 11 runs in the sixth inning and did not come to bat in the bottom of the ninth (shown by the 'x').

Not every line score has nine innings. Some have more and some have as few as six.

What I need to do

First, what I need to do is to get how many runs a team scored in each inning, e.g., the first, second, third, and so on, and put each runs scored in a new column. I would prefer for the solution to use dplyr.

I have reviewed stackoverflow's suggested solutions, but found none that matched what I need. If there is one, I would appreciate it if you would share its URL.

I have tried to do it using this code:

gamedata %>%
  select(ondate, team, linescore) %>%
  mutate(inng1 = str_extract(linescore, "\d|$\d{2}$"))

Here is the output:

ondate      team linescore    inng1
2020-08-01  NYM 010000000       0   
2020-08-02  NYM (10)1140006x  (10)  
2020-08-13  BOS 002200010       0   
2020-08-15  NYM 00000(11)01x    0   
2020-08-20  BOS 311200          3

Second, how can I remove the parentheses in the inng1 column for '10'?

The code below produced the error beneath it:

gamedata %>%
  select(ondate, team, linescore) %>%
  mutate(inng1 = str_extract(linescore, "\d|$\d{2}$"))
 str_remove_all(inng1,"[()]")

This is the error message I got:

"Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), : object 'inng1' not found"

Third, I need to know how to extract the runs scored for each additional inning, starting with the second, putting each value in its own column, e.g, inng2, inng3, and so one.

At the end, I should have the output shown above (without parentheses for each double-digit inning) with a column for each inning, so there would be a column titled "inng1," "inng2," inng3", "inng4", and so on. The data in the inning columns needs to be numeric as later I will be summing them.

AnilGoyal · Accepted Answer

Something like this will do-

using base R's gsub for some regex transformation
using stringr::str_trim and stringr::str_count() (that is optional though)
using tidyr::separate
along with dplyr::mutate.

Steps-

remove x from string linescore (I mutate this into a new column, you may mutate existing column also)
replace every character outside parenthesis with that character plus a space using gsub again with the help of regex
thereafter remove parenthesis strings
use tidyr::separate to separate string into a different column each.
use convert = TRUE to convert each string into numeric.

For regex transformation explanation check this

library(tidyverse)
df <- structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000", 
                                                                                "(10)1140006x", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475, 
                                                                                                                                                             18476, 18487, 18489, 18494), class = "Date")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                -5L))

df %>%
  mutate(inn = gsub('x', '', linescore),
         inn = str_trim(gsub("(.)(?![^(]*\))", "\1 ", inn, perl=TRUE)),
         inn = gsub('$|$', '', inn),
         innings_count = 1 + str_count(inn, ' ')) %>%
  separate(inn, into = paste0('innings_', seq(max(.$innings_count))), sep = ' ', fill = 'right', convert = TRUE)
#>   team    linescore     ondate innings_1 innings_2 innings_3 innings_4
#> 1  NYM    010000000 2020-08-01         0         1         0         0
#> 2  NYM (10)1140006x 2020-08-02        10         1         1         4
#> 3  BOS    002200010 2020-08-13         0         0         2         2
#> 4  NYM 00000(11)01x 2020-08-15         0         0         0         0
#> 5  BOS       311200 2020-08-20         3         1         1         2
#>   innings_5 innings_6 innings_7 innings_8 innings_9 innings_count
#> 1         0         0         0         0         0             9
#> 2         0         0         0         6        NA             8
#> 3         0         0         0         1         0             9
#> 4         0        11         0         1        NA             8
#> 5         0         0        NA        NA        NA             6

Need to extract individual characters from a string column using R

Answers (2)

Related Questions