Reputation: 101
I am attempting to merge two different datasets: nflfastrpbp
and routes.merging
.
While both datasets have identifying factors for each game:
nflfastrpbp = game_id, old_game_id
routes.merging = GameID
... they are not matches.
Here is a look at the nflfastrpbp
data:
A tibble: 48,514 x 8
game_id old_game_id week home_team away_team game_date pass_defense_1_player_id pass_defense_1_player_name
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
1 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
2 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
3 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
4 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
5 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
6 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
7 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
8 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
9 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
10 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
And here is a look at the routes.merging
data:
# A tibble: 80,676 x 6
EventID GameID Season Week OffensiveTeam DefensiveTeam
<int> <int> <int> <int> <chr> <chr>
1 15 2793 2020 1 Texans Chiefs
2 15 2793 2020 1 Texans Chiefs
3 15 2793 2020 1 Texans Chiefs
4 15 2793 2020 1 Texans Chiefs
5 15 2793 2020 1 Texans Chiefs
6 25 2793 2020 1 Texans Chiefs
7 25 2793 2020 1 Texans Chiefs
8 25 2793 2020 1 Texans Chiefs
9 25 2793 2020 1 Texans Chiefs
10 45 2793 2020 1 Chiefs Texans
# ... with 80,666 more rows
What I am trying to do: I am attempting to get the game_id
from the nflfastrpbp
data onto the routes.merging
data and, of course, matching it up with the correct games so that I can merge the two together (specifically to pull the pass_defense_player
information from nflfastrpbp
to routes.merging
.)
I've been trying to write a function but cannot figure it out.
If it helps, here is reprex for each dataset (I will include the 2020_01_ARI_SF
game from both for helping in matching).
nflfastrpbp
reprex:
structure(list(game_id = c("2020_01_ARI_SF", "2020_01_ARI_SF",
"2020_01_ARI_SF", "2020_01_ARI_SF", "2020_01_ARI_SF"), old_game_id = c("2020091311",
"2020091311", "2020091311", "2020091311", "2020091311"), week = c(1L,
1L, 1L, 1L, 1L), home_team = c("SF", "SF", "SF", "SF", "SF"),
away_team = c("ARI", "ARI", "ARI", "ARI", "ARI"), game_date = c("2020-09-13",
"2020-09-13", "2020-09-13", "2020-09-13", "2020-09-13"),
pass_defense_1_player_id = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), pass_defense_1_player_name = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
routes.merging
reprex:
structure(list(EventID = c(30L, 30L, 30L, 30L, 45L), GameID = c(2805L,
2805L, 2805L, 2805L, 2805L), Season = c(2020L, 2020L, 2020L,
2020L, 2020L), Week = c(1L, 1L, 1L, 1L, 1L), OffensiveTeam = c("49ers",
"49ers", "49ers", "49ers", "Cardinals"), DefensiveTeam = c("Cardinals",
"Cardinals", "Cardinals", "Cardinals", "49ers")), row.names = c(NA,
-5L), groups = structure(list(EventID = c(30L, 45L), GameID = c(2805L,
2805L), .rows = structure(list(1:4, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
I hope all that made sense.
EDIT - Expected Outcome: The expected outcome is the routes.merging
DF with a new column, id.for.merging
, that is the game_id
from the nflfastrpbp
DF ... again, matched up correctly by game.
EventID GameID Season Week OffensiveTeam DefensiveTeam id.for.merging
<int> <int> <int> <int> <chr> <chr> <chr>
1 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
2 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
3 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
4 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
5 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
6 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
7 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
8 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
9 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
10 45 2793 2020 1 Chiefs Texans 2020_01_HOU_KC
EDIT #2: The GameID from nflfastrpbp
and GameID from routes.merging
DO NOT match. That is why I am here for help. As seen in the above Expected Outcome, I need the GameID
from nflfastrpbp
to be on the data for routes.merging
so that I can merge all the data from nflfastrpbp
onto the routes.merging
DF.
I started to write a function
that used paste
and got as far as 2020_0
but couldn't figure out how to grab the week
(which is 01
in the above example ... but will go all the way to 17 with the full data) and then the away_team
followed by the home_team
... so 2020_01_HOU_KC
EDIT #3 There is not a column to match.
I am trying to CREATE that column by recreating the game_id
column in nflfastpbp
within the routes.merging
DF so that I can merge the two together on that newly created column.
So, I started to write this function:
testing <- function(x) {
add.column.nflfastrpbp.to.routes.merged <- paste("2020_0")
}
routes.merging$id.for.merging <- testing()
And, in the id.for.merging
column in routes.merging
you can see it is working:
EventID GameID Season Week OffensiveTeam DefensiveTeam id.for.merging
<int> <int> <int> <int> <chr> <chr> <chr>
1 15 2793 2020 1 Texans Chiefs 2020_0
2 15 2793 2020 1 Texans Chiefs 2020_0
3 15 2793 2020 1 Texans Chiefs 2020_0
4 15 2793 2020 1 Texans Chiefs 2020_0
5 15 2793 2020 1 Texans Chiefs 2020_0
6 25 2793 2020 1 Texans Chiefs 2020_0
7 25 2793 2020 1 Texans Chiefs 2020_0
8 25 2793 2020 1 Texans Chiefs 2020_0
9 25 2793 2020 1 Texans Chiefs 2020_0
10 45 2793 2020 1 Chiefs Texans 2020_0
# ... with 80,666 more rows
What I cannot figure out is how to finish writing that function to take all the information and correctly match the game_id
from nflfastrpbp
for all the unique games.
So, taking:
testing <- function(x) {
add.column.nflfastrpbp.to.routes.merged <- paste("2020_0")
}
... and finishing it so that it outputs:
2020_01_ARI_SF
or
2020_07_GB_HOU
into the newly created id.for.merging
column.
To be clear:
2020 = year (not included in the data)
01 & 07 = week (included)
GB_HOU = away_team, home_towm
Upvotes: 0
Views: 72
Reputation: 2096
You don't need to write a separate function to create a new column. But if you do want to, you can do this:
testing <- function(df) {
library(dplyr)
with(df, # assumes `df` is a data structure like `routes.merging`
paste0(
"2020_",
sprintf("%02d", Week),
"_",
case_when( # away_team == "team_name" ~ "city"
OffensiveTeam == "Texans" ~ "HOU",
OffensiveTeam == "Chiefs" ~ "KC",
OffensiveTeam == "Cardinals" ~ "ARI",
OffensiveTeam == "49ers" ~ "SF",
# etc.
OffensiveTeam == "Packers" ~ "GB"
),
"_",
case_when( # home_team == "team_name" ~ "city"
DefensiveTeam == "Texans" ~ "HOU",
DefensiveTeam == "Chiefs" ~ "KC",
DefensiveTeam == "Cardinals" ~ "ARI",
DefensiveTeam == "49ers" ~ "SF",
# etc.
DefensiveTeam == "Packers" ~ "GB"
)
)
)
}
routes.merging$id.for.merging <- testing(routes.merging)
Otherwise, you can add the column directly like this:
library(dplyr)
routes.merging <- mutate(routes.merging,
id.for.merging = paste0(
"2020_",
sprintf("%02d", Week),
"_",
case_when( # away_team == "team_name" ~ "city"
OffensiveTeam == "Texans" ~ "HOU",
OffensiveTeam == "Chiefs" ~ "KC",
OffensiveTeam == "Cardinals" ~ "ARI",
OffensiveTeam == "49ers" ~ "SF",
# etc.
OffensiveTeam == "Packers" ~ "GB"
),
"_",
case_when( # home_team == "team_name" ~ "city"
DefensiveTeam == "Texans" ~ "HOU",
DefensiveTeam == "Chiefs" ~ "KC",
DefensiveTeam == "Cardinals" ~ "ARI",
DefensiveTeam == "49ers" ~ "SF",
# etc.
DefensiveTeam == "Packers" ~ "GB"
)
)
)
sprintf("%02d", Week)
makes any single digit (e.g., "1") into double digits (e.g., "01"), and double digits stay double digits.
case_when()
is a function in dplyr
R package. The function allows you to vectorize multiple ifelse()
statements. You will need to add more lines in case_when()
for a complete list of the NFL teams, of course.
The output from using your reprex data structure looks like this:
# A tibble: 5 x 7
# Groups: EventID, GameID [2]
EventID GameID Season Week OffensiveTeam DefensiveTeam id.for.merging
<int> <int> <int> <int> <chr> <chr> <chr>
1 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
2 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
3 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
4 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
5 45 2805 2020 1 Cardinals 49ers 2020_01_ARI_SF
Finally, merging:
merged_data <- full_join(routes.merging, nflfastrpbp, by = c("id.for.merging" = "game_id"))
Run ?dplyr::join
or ?merge
to learn more about some other merge functions and options.
Upvotes: 1