Sri Sreshtan
Sri Sreshtan

Reputation: 41

How to merge multiple variables and create a new data set?

https://www.kaggle.com/nowke9/ipldata ----- Contains the IPL Data.

This is exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I have created four more variables namely total_extras, total_runs_scored, total_fours_hit and total_sixes_hit. Now I wish to combine these newly created variables into one single data frame. When I assign these variables into one single variable namely batsman_aggregate and selecting only the required columns, I am getting an error message.

    library(tidyverse)
    deliveries_tbl <- read.csv("deliveries_edit.csv")
    matches_tbl <- read.csv("matches.csv")

    combined_matches_deliveries_tbl <- deliveries_tbl %>%
    left_join(matches_tbl, by = c("match_id" = "id"))

    # Add team score and team extra columns for each match, each inning.
    total_score_extras_combined <- combined_matches_deliveries_tbl%>%
    group_by(id, inning, date, batting_team, bowling_team, winner)%>%
    mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
    mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
    group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
    select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
    distinct(total_score, total_extras)%>%
    glimpse()%>%
    ungroup()


# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
    group_by(id, inning, batting_team, batsman)%>%
    mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
    distinct(total_batsman_runs)%>%
    glimpse()%>%
        ungroup()

# Number of deliveries played . 
balls_faced <- combined_matches_deliveries_tbl %>%
    filter(wide_runs == 0)%>%
    group_by(id, inning, batsman)%>%
    summarise(deliveries_played = n())%>%
    ungroup()

# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
    filter(batsman_runs == 4)%>%
    group_by(id, inning, batsman)%>%
    summarise(fours_hit = n())%>%
    glimpse()%>%
    ungroup()

sixes_hit <- combined_matches_deliveries_tbl %>%
    filter(batsman_runs == 6)%>%
    group_by(id, inning, batsman)%>%
    summarise(sixes_hit = n())%>%
    glimpse()%>%
    ungroup()

batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
    select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)

The error message is displayed as:-

Error: `select()` doesn't handle lists.

The required output is the data set created newly constructed variables.

Upvotes: 1

Views: 199

Answers (1)

Edward
Edward

Reputation: 18653

You'll have to join those four tables, not combine using c.

And the join type is left_join so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.

I've ignored the by since dplyr will assume you want c("id", "inning", "batsman"), the only 3 common columns in all four data sets.

batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
  left_join(fours_hit) %>%
  left_join(sixes_hit) %>%
  select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
  replace(is.na(.), 0)

# A tibble: 11,335 x 7
      id inning batsman       total_batsman_runs deliveries_played fours_hit sixes_hit
   <int>  <int> <fct>                      <int>             <dbl>     <dbl>     <dbl>
 1     1      1 DA Warner                     14                 8         2         1
 2     1      1 S Dhawan                      40                31         5         0
 3     1      1 MC Henriques                  52                37         3         2
 4     1      1 Yuvraj Singh                  62                27         7         3
 5     1      1 DJ Hooda                      16                12         0         1
 6     1      1 BCJ Cutting                   16                 6         0         2
 7     1      2 CH Gayle                      32                21         2         3
 8     1      2 Mandeep Singh                 24                16         5         0
 9     1      2 TM Head                       30                22         3         0
10     1      2 KM Jadhav                     31                16         4         1
# ... with 11,325 more rows

There are also 2 batsmen who didn't face any delivery:

batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
     id inning batsman        total_batsman_runs deliveries_played fours_hit sixes_hit
  <int>  <int> <fct>                       <int>             <dbl>     <dbl>     <dbl>
1   482      2 MK Pandey                       0                 0         0         0
2  7907      1 MJ McClenaghan                  2                 0         0         0

One of which apparently scored 2 runs! So I think the batsman_runs column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.

Upvotes: 1

Related Questions