Joshua Oehmen
Joshua Oehmen

Reputation: 21

How to Transform a Soccer Match DataFrame to a Long Format with Separate Rows for Home and Away Teams in R

I have a DataFrame in R with the following columns:

Here is an example of my data:

  df <- data.frame(
  season = c("2015/2016", "2015/2016"),
  stage = c(1, 1),
  home_team_api_id = c(1, 2),
  away_team_api_id = c(2, 1),
  home_team_goal = c(3, 2),
  away_team_goal = c(1, 3),
  match_api_id = c(101, 102)
)

I want to transform this DataFrame into a long format where each match has two rows: one for the home team and one for the away team. The following columns should be included in the transformed DataFrame:

Desired output: For the example input data, the desired output would look like this:

     season stage match_api_id team_api_id opponent_team_api_id goals goals_conceded is_home
1  2015/2016     1           101           1                   2     3              1    TRUE
2  2015/2016     1           101           2                   1     1              3   FALSE
3  2015/2016     1           102           2                   1     2              3    TRUE
4  2015/2016     1           102           1                   2     3              1   FALSE

Here is what I've tried so far:

df_long <- df %>%
  pivot_longer(cols = c(home_team_api_id, away_team_api_id), 
               names_to = "team_type", 
               values_to = "team_api_id") %>%
  mutate(
    is_home = ifelse(team_type == "home_team_api_id", TRUE, FALSE),
    goals = ifelse(is_home, home_team_goal, away_team_goal),
    goals_conceded = ifelse(is_home, away_team_goal, home_team_goal)
  ) %>%
  select(match_api_id, season, stage, team_api_id, goals, goals_conceded, is_home)

# opponent_team_api_id basierend auf match_api_id anhängen
df_long <- df_long %>%
  left_join(df %>%
              select(match_api_id, home_team_api_id, away_team_api_id),
            by = "match_api_id") %>%
  mutate(
    opponent_team_api_id = ifelse(is_home, away_team_api_id, home_team_api_id)
  ) %>%
  select(-home_team_api_id, -away_team_api_id)

This is my result:

match_api_id    season  stage   team_api_id goals   goals_conceded  is_home
1   2015/2016   1   1   1   3   TRUE
2   2015/2016   1   2   1   3   FALSE
3   2015/2016   1   2   2   3   TRUE
4   2015/2016   1   1   3   2   FALSE

What makes it difficult and seperates it from this question, is that I want to apply pivot_longer two times at once. I want to longer the goals and the teamIDs

How can I achieve this transformation in R? Any help would be greatly appreciated!

Thank you!

Upvotes: -1

Views: 61

Answers (1)

Near Lin
Near Lin

Reputation: 109

I would pivot_longer into 4 rows and then pivot_wider back to 2 rows.

main <- df |>
  pivot_longer(cols = c(home_team_api_id, away_team_api_id, home_team_goal, away_team_goal)) |>
  separate_wider_delim(name, delim = "_team_", names = c("is_home", "var")) |>
  pivot_wider(names_from = var, values_from = value)

The result would be:

  season    stage match_api_id is_home  api_id  goal
  <chr>     <dbl>        <dbl>   <chr>   <dbl> <dbl>
1 2015/2016     1          101    home       1     3
2 2015/2016     1          101    away       2     1
3 2015/2016     1          102    home       2     2
4 2015/2016     1          102    away       1     3

If it is absolutely necessary to retain information about the opponent (which appears to be redundant), duplicate another dataset and merge it back.

sub <- main |>
  mutate(is_home = ifelse(is_home == "home", "away", "home")) |>
  rename_with(~ paste0("opponent_", .x), api_id:goal)

complete <- left_join(main, sub) |>
  mutate(is_home = ifelse(is_home == "home", TRUE, FALSE))

The result would be:

  season    stage match_api_id is_home api_id  goal opponent_api_id opponent_goal
  <chr>     <dbl>        <dbl> <lgl>    <dbl> <dbl>           <dbl>         <dbl>
1 2015/2016     1          101 TRUE         1     3               2             1
2 2015/2016     1          101 FALSE        2     1               1             3
3 2015/2016     1          102 TRUE         2     2               1             3
4 2015/2016     1          102 FALSE        1     3               2             2

I believe there are better solutions, such as simply mutate variables with the original dataset, or use group_by and summarise after step 1.

Upvotes: 0

Related Questions