Reputation: 3
I have strings of soccer games and I'm trying to break them down into individual parts in R. For example,
"Jun 0103:00 PMTottenham0 - 2Liverpool(0 - 1)" should return
"Jun 01", "3:00PM", "Tottenham", "0", "2", "Liverpool", "0", "1"
And
"May 0803:00 PMAjax2 - 3Tottenham(2 - 0)" should return
"May 08", "3:00PM", "Ajax", "2", "3", "Tottenham", "2", "0"
The goal is to get this into dataframe with headers
c("Date", "Time", "Home team", "Home team score",
"Away team score", "Away team", "Home team HT score", "Away team HT score")
Upvotes: 0
Views: 67
Reputation: 593
The tidyverse way...
library(tidyverse)
library(stringr)
strings <- tibble(full = c("Jun 0103:00 PMTottenham0 - 2Liverpool(0 - 1)",
"May 0803:00 PMAjax2 - 3Tottenham(2 - 0)"))
strings %>% mutate(date = str_extract(full, ".{6}"),
time = str_extract(full, "\\d{2}:\\d{2}\\s(AM|PM)"),
team_home = str_extract(full, "(AM|PM)[[:alpha:]]+"),
team_home = str_remove(team_home, "(AM|PM)"),
score_home = str_extract(full, "\\d+\\s-"),
score_away = str_extract(full, "-\\s\\d+"),
team_away = str_extract(full, "\\d[[:alpha:]]+"),
team_away = str_remove(team_away, "\\d"),
score_ht_home = str_extract(full, "\\(."),
score_ht_away = str_extract(full, ".\\)")) %>%
mutate_at(vars(starts_with("score")), str_extract, pattern = "\\d+") %>%
mutate_at(vars(starts_with("score")), as.numeric) %>%
select(-full)
# A tibble: 2 x 8
date time team_home score_home score_away team_away score_ht_home score_ht_away
<chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Jun 01 03:00 PM Tottenham 0 2 Liverpool 0 1
2 May 08 03:00 PM Ajax 2 3 Tottenham 2 0
Upvotes: 0
Reputation: 32548
x = c("Jun 0103:00 PMTottenham0 - 2Liverpool(0 - 1)", "May 0803:00 PMAjax2 - 3Tottenham(2 - 0)")
read.csv(header = FALSE,
text = gsub("(^.{6})(.{8})(\\D+)(\\d+)\\s-\\s(\\d+)(\\D+)\\((\\d+)\\s-\\s(\\d+).*",
"\\1,\\2,\\3,\\4,\\5,\\6,\\7,\\8",
x))
# V1 V2 V3 V4 V5 V6 V7 V8
#1 Jun 01 03:00 PM Tottenham 0 2 Liverpool 0 1
#2 May 08 03:00 PM Ajax 2 3 Tottenham 2 0
Upvotes: 2