Metsfan
Metsfan

Reputation: 520

Need to extract individual characters from a string column using R

Background

Below is my gamedata dataset in dput form -- it contains line scores for some MLB games.

structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000", 
"(10)1140006x", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475, 
18476, 18487, 18489, 18494), class = "Date")), class = "data.frame", row.names = c(NA, 
-5L))

For example, here is one line score: "002200010".

Some line scores end with an 'x' and some have a two-digit value in parentheses, as in "00000(11)01x". Each digit not in parentheses indicates how many runs the team scored in the inning. If a team scores more than nine runs in an inning, the number is placed in parentheses, so in the line score "00000(11)01x" the team scored 11 runs in the sixth inning and did not come to bat in the bottom of the ninth (shown by the 'x').

Not every line score has nine innings. Some have more and some have as few as six.

What I need to do

First, what I need to do is to get how many runs a team scored in each inning, e.g., the first, second, third, and so on, and put each runs scored in a new column. I would prefer for the solution to use dplyr.

I have reviewed stackoverflow's suggested solutions, but found none that matched what I need. If there is one, I would appreciate it if you would share its URL.

I have tried to do it using this code:

gamedata %>%
  select(ondate, team, linescore) %>%
  mutate(inng1 = str_extract(linescore, "\\d|\\(\\d{2}\\)"))

Here is the output:

ondate      team linescore    inng1
2020-08-01  NYM 010000000       0   
2020-08-02  NYM (10)1140006x  (10)  
2020-08-13  BOS 002200010       0   
2020-08-15  NYM 00000(11)01x    0   
2020-08-20  BOS 311200          3

Second, how can I remove the parentheses in the inng1 column for '10'?

The code below produced the error beneath it:

gamedata %>%
  select(ondate, team, linescore) %>%
  mutate(inng1 = str_extract(linescore, "\\d|\\(\\d{2}\\)"))
 str_remove_all(inng1,"[()]")

This is the error message I got:

"Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), : object 'inng1' not found"

Third, I need to know how to extract the runs scored for each additional inning, starting with the second, putting each value in its own column, e.g, inng2, inng3, and so one.

At the end, I should have the output shown above (without parentheses for each double-digit inning) with a column for each inning, so there would be a column titled "inng1," "inng2," inng3", "inng4", and so on. The data in the inning columns needs to be numeric as later I will be summing them.

Upvotes: 3

Views: 252

Answers (2)

AnilGoyal
AnilGoyal

Reputation: 26218

Something like this will do-

  • using base R's gsub for some regex transformation
  • using stringr::str_trim and stringr::str_count() (that is optional though)
  • using tidyr::separate
  • along with dplyr::mutate.

Steps-

  • remove x from string linescore (I mutate this into a new column, you may mutate existing column also)
  • replace every character outside parenthesis with that character plus a space using gsub again with the help of regex
  • thereafter remove parenthesis strings
  • use tidyr::separate to separate string into a different column each.
  • use convert = TRUE to convert each string into numeric.

For regex transformation explanation check this

library(tidyverse)
df <- structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000", 
                                                                                "(10)1140006x", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475, 
                                                                                                                                                             18476, 18487, 18489, 18494), class = "Date")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                -5L))

df %>%
  mutate(inn = gsub('x', '', linescore),
         inn = str_trim(gsub("(.)(?![^(]*\\))", "\\1 ", inn, perl=TRUE)),
         inn = gsub('\\(|\\)', '', inn),
         innings_count = 1 + str_count(inn, ' ')) %>%
  separate(inn, into = paste0('innings_', seq(max(.$innings_count))), sep = ' ', fill = 'right', convert = TRUE)
#>   team    linescore     ondate innings_1 innings_2 innings_3 innings_4
#> 1  NYM    010000000 2020-08-01         0         1         0         0
#> 2  NYM (10)1140006x 2020-08-02        10         1         1         4
#> 3  BOS    002200010 2020-08-13         0         0         2         2
#> 4  NYM 00000(11)01x 2020-08-15         0         0         0         0
#> 5  BOS       311200 2020-08-20         3         1         1         2
#>   innings_5 innings_6 innings_7 innings_8 innings_9 innings_count
#> 1         0         0         0         0         0             9
#> 2         0         0         0         6        NA             8
#> 3         0         0         0         1         0             9
#> 4         0        11         0         1        NA             8
#> 5         0         0        NA        NA        NA             6

Upvotes: 4

Anoushiravan R
Anoushiravan R

Reputation: 21908

Solution02

Here is another solution you can use for this question which is much more efficient than the first one and is predominantly based on purrr family of functions:

library(dplyr)
library(purrr)

df %>%
  bind_cols(
    map(df %>% select(linescore), ~ strsplit(.x, "\\(|\\)")) %>%
      flatten() %>%
      map_dfr(~ map(.x, ~ if(nchar(.x) > 2) strsplit(.x, "")[[1]] else .x) %>%
                reduce(~ c(.x, .y)) %>%
                keep(~ nchar(.x) != 0) %>% t() %>%
                as_tibble() %>% 
                set_names(~ paste0("inng", 1:length(.x)))) %>%
      mutate(across(everything(), ~ replace(.x, .x == "x", NA_character_)), 
             count_inng = pmap_dbl(cur_data(), ~ sum(!is.na(c(...)))), 
             sums_inng = pmap_dbl(select(cur_data(), starts_with("inng")), 
                                  ~ sum(as.numeric(c(...)), na.rm = TRUE)))
  )

  team    linescore     ondate inng1 inng2 inng3 inng4 inng5 inng6 inng7 inng8 inng9 count_inng
1  NYM    010000000 2020-08-01     0     1     0     0     0     0     0     0     0          9
2  NYM (10)1140006x 2020-08-02    10     1     1     4     0     0     0     6  <NA>          8
3  BOS    002200010 2020-08-13     0     0     2     2     0     0     0     1     0          9
4  NYM 00000(11)01x 2020-08-15     0     0     0     0     0    11     0     1  <NA>          8
5  BOS       311200 2020-08-20     3     1     1     2     0     0  <NA>  <NA>  <NA>          6
  sums_inng
1         1
2        22
3         5
4        12
5         7

Solution01

I have made some modifications to my solution as it incorrectly replaced the double-digit number in the output vector and I think it has been fixed. I think this solution may help you. For this purpose I decided to write a custom function to detect two-digits numbers and also trim the output of your scores:

library(dplyr)
library(stringr)
library(tidyr)
library(purrr)

fn <- function(x) {
  out <- c()
  if(str_detect(x, "\\((\\d){2}\\)")) {
    double <- str_replace_all(str_extract(x, "\\((\\d){2}\\)"), "[)()]", "")
    ind <- str_locate(x, "\\(")
    x <- str_remove(x, "\\((\\d){2}\\)")
    out <- c(out, str_split(x, "")[[1]])
    out[(ind[1, 1]+1):(length(out)+1)] <- out[(ind[1, 1]):length(out)]
    out[ind] <- double
  } else {
    out <- c(out, str_split(x, "")[[1]])
  }
  if(any(grepl(")", out))) {
    out <- out[-which(out == ")")]
  }
  out
}

# Test
fn("(10)1140006x)")
[1] "10" "1"  "1"  "4"  "0"  "0"  "0"  "6"  "x" 

Then we apply it on our data set in a row-wise operation:

df %>%
  mutate(linescore = map(linescore, fn)) %>% 
  unnest_wider(linescore) %>%
  rename_with(~ gsub("(\\.\\.\\.)(\\d)", paste0("inng", "\\2"), .), starts_with("...")) %>%
  mutate(across(starts_with("inng"), ~ {replace(.x, .x == "x", NA)
    as.numeric(.x)}), 
    inns_count = pmap_dbl(select(cur_data(), starts_with("inng")), 
                          ~ sum(!is.na(c(...)))), 
    inns_sums = pmap_dbl(select(cur_data(), starts_with("inng")), 
                         ~ sum(c(...), na.rm = TRUE)))

# A tibble: 5 x 13
  team  inng1 inng2 inng3 inng4 inng5 inng6 inng7 inng8 inng9 ondate     inns_count inns_sums
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date>          <dbl>     <dbl>
1 NYM       0     1     0     0     0     0     0     0     0 2020-08-01          9         1
2 NYM      10     1     1     4     0     0     0     6    NA 2020-08-02          8        22
3 BOS       0     0     2     2     0     0     0     1     0 2020-08-13          9         5
4 NYM       0     0     0     0     0    11     0     1    NA 2020-08-15          8        12
5 BOS       3     1     1     2     0     0    NA    NA    NA 2020-08-20          6         7

Upvotes: 5

Related Questions