bcongelio
bcongelio

Reputation: 101

Writing function to help merge to different datasets

I am attempting to merge two different datasets: nflfastrpbp and routes.merging.

While both datasets have identifying factors for each game:

nflfastrpbp = game_id, old_game_id

routes.merging = GameID

... they are not matches.

Here is a look at the nflfastrpbp data:

A tibble: 48,514 x 8
   game_id        old_game_id  week home_team away_team game_date  pass_defense_1_player_id pass_defense_1_player_name
   <chr>          <chr>       <int> <chr>     <chr>     <chr>      <chr>                    <chr>                     
 1 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA                        
 2 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA                        
 3 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA                        
 4 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA                        
 5 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA                        
 6 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA                        
 7 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA                        
 8 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA                        
 9 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA                        
10 2020_01_ARI_SF 2020091311      1 SF        ARI       2020-09-13 NA                       NA      

And here is a look at the routes.merging data:

# A tibble: 80,676 x 6
   EventID GameID Season  Week OffensiveTeam DefensiveTeam
     <int>  <int>  <int> <int> <chr>         <chr>        
 1      15   2793   2020     1 Texans        Chiefs       
 2      15   2793   2020     1 Texans        Chiefs       
 3      15   2793   2020     1 Texans        Chiefs       
 4      15   2793   2020     1 Texans        Chiefs       
 5      15   2793   2020     1 Texans        Chiefs       
 6      25   2793   2020     1 Texans        Chiefs       
 7      25   2793   2020     1 Texans        Chiefs       
 8      25   2793   2020     1 Texans        Chiefs       
 9      25   2793   2020     1 Texans        Chiefs       
10      45   2793   2020     1 Chiefs        Texans       
# ... with 80,666 more rows

What I am trying to do: I am attempting to get the game_id from the nflfastrpbp data onto the routes.merging data and, of course, matching it up with the correct games so that I can merge the two together (specifically to pull the pass_defense_player information from nflfastrpbp to routes.merging.)

I've been trying to write a function but cannot figure it out.

If it helps, here is reprex for each dataset (I will include the 2020_01_ARI_SF game from both for helping in matching).

nflfastrpbp reprex:

structure(list(game_id = c("2020_01_ARI_SF", "2020_01_ARI_SF", 
"2020_01_ARI_SF", "2020_01_ARI_SF", "2020_01_ARI_SF"), old_game_id = c("2020091311", 
"2020091311", "2020091311", "2020091311", "2020091311"), week = c(1L, 
1L, 1L, 1L, 1L), home_team = c("SF", "SF", "SF", "SF", "SF"), 
    away_team = c("ARI", "ARI", "ARI", "ARI", "ARI"), game_date = c("2020-09-13", 
    "2020-09-13", "2020-09-13", "2020-09-13", "2020-09-13"), 
    pass_defense_1_player_id = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), pass_defense_1_player_name = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    )), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

routes.merging reprex:

structure(list(EventID = c(30L, 30L, 30L, 30L, 45L), GameID = c(2805L, 
2805L, 2805L, 2805L, 2805L), Season = c(2020L, 2020L, 2020L, 
2020L, 2020L), Week = c(1L, 1L, 1L, 1L, 1L), OffensiveTeam = c("49ers", 
"49ers", "49ers", "49ers", "Cardinals"), DefensiveTeam = c("Cardinals", 
"Cardinals", "Cardinals", "Cardinals", "49ers")), row.names = c(NA, 
-5L), groups = structure(list(EventID = c(30L, 45L), GameID = c(2805L, 
2805L), .rows = structure(list(1:4, 5L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

I hope all that made sense.

EDIT - Expected Outcome: The expected outcome is the routes.merging DF with a new column, id.for.merging, that is the game_id from the nflfastrpbp DF ... again, matched up correctly by game.

 EventID GameID Season  Week OffensiveTeam DefensiveTeam id.for.merging
     <int>  <int>  <int> <int> <chr>         <chr>         <chr>         
 1      15   2793   2020     1 Texans        Chiefs        2020_01_HOU_KC
 2      15   2793   2020     1 Texans        Chiefs        2020_01_HOU_KC
 3      15   2793   2020     1 Texans        Chiefs        2020_01_HOU_KC
 4      15   2793   2020     1 Texans        Chiefs        2020_01_HOU_KC
 5      15   2793   2020     1 Texans        Chiefs        2020_01_HOU_KC
 6      25   2793   2020     1 Texans        Chiefs        2020_01_HOU_KC
 7      25   2793   2020     1 Texans        Chiefs        2020_01_HOU_KC
 8      25   2793   2020     1 Texans        Chiefs        2020_01_HOU_KC
 9      25   2793   2020     1 Texans        Chiefs        2020_01_HOU_KC
10      45   2793   2020     1 Chiefs        Texans        2020_01_HOU_KC 

EDIT #2: The GameID from nflfastrpbp and GameID from routes.merging DO NOT match. That is why I am here for help. As seen in the above Expected Outcome, I need the GameID from nflfastrpbp to be on the data for routes.merging so that I can merge all the data from nflfastrpbp onto the routes.merging DF.

I started to write a function that used paste and got as far as 2020_0 but couldn't figure out how to grab the week (which is 01 in the above example ... but will go all the way to 17 with the full data) and then the away_team followed by the home_team ... so 2020_01_HOU_KC

EDIT #3 There is not a column to match.

I am trying to CREATE that column by recreating the game_id column in nflfastpbp within the routes.merging DF so that I can merge the two together on that newly created column.

So, I started to write this function:

testing <- function(x) {
  add.column.nflfastrpbp.to.routes.merged <- paste("2020_0")
}

routes.merging$id.for.merging <- testing()

And, in the id.for.merging column in routes.merging you can see it is working:

 EventID GameID Season  Week OffensiveTeam DefensiveTeam id.for.merging
     <int>  <int>  <int> <int> <chr>         <chr>         <chr>         
 1      15   2793   2020     1 Texans        Chiefs        2020_0        
 2      15   2793   2020     1 Texans        Chiefs        2020_0        
 3      15   2793   2020     1 Texans        Chiefs        2020_0        
 4      15   2793   2020     1 Texans        Chiefs        2020_0        
 5      15   2793   2020     1 Texans        Chiefs        2020_0        
 6      25   2793   2020     1 Texans        Chiefs        2020_0        
 7      25   2793   2020     1 Texans        Chiefs        2020_0        
 8      25   2793   2020     1 Texans        Chiefs        2020_0        
 9      25   2793   2020     1 Texans        Chiefs        2020_0        
10      45   2793   2020     1 Chiefs        Texans        2020_0        
# ... with 80,666 more rows

What I cannot figure out is how to finish writing that function to take all the information and correctly match the game_id from nflfastrpbp for all the unique games.

So, taking:

testing <- function(x) {
  add.column.nflfastrpbp.to.routes.merged <- paste("2020_0")
}

... and finishing it so that it outputs:

2020_01_ARI_SF

or 

2020_07_GB_HOU

into the newly created id.for.merging column.

To be clear:

2020 = year (not included in the data)

01 & 07 = week (included)

GB_HOU = away_team, home_towm

Upvotes: 0

Views: 72

Answers (1)

LC-datascientist
LC-datascientist

Reputation: 2096

You don't need to write a separate function to create a new column. But if you do want to, you can do this:

testing <- function(df) {
    library(dplyr)
    with(df, # assumes `df` is a data structure like `routes.merging`
        paste0(
            "2020_", 
            sprintf("%02d", Week), 
            "_", 
            case_when( # away_team == "team_name" ~ "city"
                OffensiveTeam == "Texans" ~ "HOU", 
                OffensiveTeam == "Chiefs" ~ "KC",
                OffensiveTeam == "Cardinals" ~ "ARI",  
                OffensiveTeam == "49ers" ~ "SF", 
                # etc.
                OffensiveTeam == "Packers" ~ "GB" 
            ), 
            "_", 
            case_when( # home_team == "team_name" ~ "city"
                DefensiveTeam == "Texans" ~ "HOU", 
                DefensiveTeam == "Chiefs" ~ "KC", 
                DefensiveTeam == "Cardinals" ~ "ARI", 
                DefensiveTeam == "49ers" ~ "SF", 
                # etc.
                DefensiveTeam == "Packers" ~ "GB" 
            )
        )
    )
}

routes.merging$id.for.merging <- testing(routes.merging)

Otherwise, you can add the column directly like this:

library(dplyr)
routes.merging <- mutate(routes.merging, 
    id.for.merging = paste0(
        "2020_", 
        sprintf("%02d", Week), 
        "_", 
        case_when( # away_team == "team_name" ~ "city"
            OffensiveTeam == "Texans" ~ "HOU", 
            OffensiveTeam == "Chiefs" ~ "KC", 
            OffensiveTeam == "Cardinals" ~ "ARI", 
            OffensiveTeam == "49ers" ~ "SF", 
            # etc.
            OffensiveTeam == "Packers" ~ "GB" 
        ), 
        "_", 
        case_when( # home_team == "team_name" ~ "city"
            DefensiveTeam == "Texans" ~ "HOU", 
            DefensiveTeam == "Chiefs" ~ "KC", 
            DefensiveTeam == "Cardinals" ~ "ARI", 
            DefensiveTeam == "49ers" ~ "SF", 
            # etc.
            DefensiveTeam == "Packers" ~ "GB" 
        )
    )
)

sprintf("%02d", Week) makes any single digit (e.g., "1") into double digits (e.g., "01"), and double digits stay double digits.

case_when() is a function in dplyr R package. The function allows you to vectorize multiple ifelse() statements. You will need to add more lines in case_when() for a complete list of the NFL teams, of course.

The output from using your reprex data structure looks like this:

# A tibble: 5 x 7
# Groups:   EventID, GameID [2]
  EventID GameID Season  Week OffensiveTeam DefensiveTeam id.for.merging
    <int>  <int>  <int> <int> <chr>         <chr>         <chr>         
1      30   2805   2020     1 49ers         Cardinals     2020_01_SF_ARI
2      30   2805   2020     1 49ers         Cardinals     2020_01_SF_ARI
3      30   2805   2020     1 49ers         Cardinals     2020_01_SF_ARI
4      30   2805   2020     1 49ers         Cardinals     2020_01_SF_ARI
5      45   2805   2020     1 Cardinals     49ers         2020_01_ARI_SF

Finally, merging:

merged_data <- full_join(routes.merging, nflfastrpbp, by = c("id.for.merging" = "game_id"))

Run ?dplyr::join or ?merge to learn more about some other merge functions and options.

Upvotes: 1

Related Questions