Reputation: 11
I have a set of dataframes/entity set for rugby league/sports data: players, teams, venues, games, team_stats and player_stats
players: player_id, player_name
teams: team_id, team_name
games: game_id, venue_id
venues: venue_id, venue_coordinates
team_stats: game_id, team_id + relevant team statistics for that game
player_stats: game_id, team_id, player_id + relevant player statistics for that game
The end goal is to generate features for a given team in a given game based on aggregated historical player statistics. e.g. game_id, team_id and for example sum of the player mean of a given statistic
As a worked example if we had 2 players in a team who both had historical values of [1, 2] for a given statistic I would like that to show as sum of mean statistic being 3 for the team.
What would be the most efficient way to achieve this?
I have tried using group by trans primitives to no avail. I'm not sure if I need to add team_id into the players dataframe?
Upvotes: 1
Views: 65
Reputation: 386
If I understand your question correctly, what you are trying to do is aggregate player stats from the player_stats
table to the team_stats
table. One way to do this would be to create a relationship between those two tables and use various aggregation primitives.
That relationship can be defined by the combination of (team_id, game_id)
in the two tables. Unfortunately, Featuretools does not support defining relationships using multiple columns in one table, so you probably need to create a new column that can be used to define the relationship.
Here is one way you could do this:
import pandas as pd
import featuretools as ft
# Define team stats
team_stats = pd.DataFrame({
"id": [0, 1, 2, 3],
"team_id": [100, 200, 100, 200],
"game_id": [0, 0, 1, 1],
})
# Define player stats
player_stats = pd.DataFrame({
"id": [0, 1, 2, 3, 4, 5, 6, 7],
"team_id": [100, 100, 200, 200, 100, 100, 200, 200],
"game_id": [0, 0, 0, 0, 1, 1, 1, 1],
"player_id": [0, 1, 2, 3, 0, 1, 2, 3],
"goals": [1, 2, 0, 1, 2, 0, 1, 2],
"minutes_played": [30, 45, 10, 50, 10, 40, 25, 45],
})
# Create new index columns for use in relationship
team_stats["created_idx"] = team_stats["team_id"].astype("string") + "-" + team_stats["game_id"].astype("string")
player_stats["created_idx"] = player_stats["team_id"].astype("string") + "-" + player_stats["game_id"].astype("string")
# Drop these columns from the players table since we no longer need them in this example.
# This prevents Featuretools from generating features from these numeric columns
player_stats = player_stats.drop(columns=["team_id", "game_id"])
# Create the EntitySet and add dataframes
es = ft.EntitySet()
es.add_dataframe(dataframe=team_stats, dataframe_name="teams", index="created_idx")
es.add_dataframe(dataframe=player_stats, dataframe_name="players", index="id")
# Add the relationship
es.add_relationship("teams", "created_idx", "players", "created_idx")
# Run DFS using the `Sum` aggregation primitive
# Use ignore columns to prevent generation of features from player_id
fm, feautres = ft.dfs(entityset=es,
target_dataframe_name="teams",
agg_primitives=["sum"],
ignore_columns={"players": ["player_id"]})
This will generate a dataframe like this:
id team_id game_id SUM(players.goals) SUM(players.minutes_played) SUM(players.player_id)
created_idx
100-0 0 100 0 3.0 75.0 1.0
200-0 1 200 0 1.0 60.0 5.0
100-1 2 100 1 2.0 50.0 1.0
200-1 3 200 1 3.0 70.0 5.0
Upvotes: 1