lekan98
lekan98

Reputation: 3

Extracting different parts of a string

I have strings of soccer games and I'm trying to break them down into individual parts in R. For example,

"Jun 0103:00 PMTottenham0 - 2Liverpool(0 - 1)" should return

"Jun 01", "3:00PM", "Tottenham", "0", "2", "Liverpool", "0", "1"

And

"May 0803:00 PMAjax2 - 3Tottenham(2 - 0)" should return

"May 08", "3:00PM", "Ajax", "2", "3", "Tottenham", "2", "0"

The goal is to get this into dataframe with headers

c("Date", "Time", "Home team", "Home team score", 
    "Away team score", "Away team", "Home team HT score", "Away team HT score")

Upvotes: 0

Views: 67

Answers (2)

knytt
knytt

Reputation: 593

The tidyverse way...

library(tidyverse)
library(stringr)

strings <- tibble(full = c("Jun 0103:00 PMTottenham0 - 2Liverpool(0 - 1)", 
                           "May 0803:00 PMAjax2 - 3Tottenham(2 - 0)"))

strings %>% mutate(date = str_extract(full, ".{6}"),
                   time = str_extract(full, "\\d{2}:\\d{2}\\s(AM|PM)"),
                   team_home = str_extract(full, "(AM|PM)[[:alpha:]]+"),
                   team_home = str_remove(team_home, "(AM|PM)"),
                   score_home = str_extract(full, "\\d+\\s-"),
                   score_away = str_extract(full, "-\\s\\d+"),
                   team_away = str_extract(full, "\\d[[:alpha:]]+"),
                   team_away = str_remove(team_away, "\\d"),
                   score_ht_home = str_extract(full, "\\(."),
                   score_ht_away = str_extract(full, ".\\)")) %>% 
  mutate_at(vars(starts_with("score")), str_extract, pattern = "\\d+") %>% 
  mutate_at(vars(starts_with("score")), as.numeric) %>% 
  select(-full)
# A tibble: 2 x 8
  date   time     team_home score_home score_away team_away score_ht_home score_ht_away
  <chr>  <chr>    <chr>          <dbl>      <dbl> <chr>             <dbl>         <dbl>
1 Jun 01 03:00 PM Tottenham          0          2 Liverpool             0             1
2 May 08 03:00 PM Ajax               2          3 Tottenham             2             0

Upvotes: 0

d.b
d.b

Reputation: 32548

x = c("Jun 0103:00 PMTottenham0 - 2Liverpool(0 - 1)", "May 0803:00 PMAjax2 - 3Tottenham(2 - 0)")
read.csv(header = FALSE,
         text = gsub("(^.{6})(.{8})(\\D+)(\\d+)\\s-\\s(\\d+)(\\D+)\\((\\d+)\\s-\\s(\\d+).*",
                     "\\1,\\2,\\3,\\4,\\5,\\6,\\7,\\8",
                     x))
#      V1       V2        V3 V4 V5        V6 V7 V8
#1 Jun 01 03:00 PM Tottenham  0  2 Liverpool  0  1
#2 May 08 03:00 PM      Ajax  2  3 Tottenham  2  0

Upvotes: 2

Related Questions