apapa2234
apapa2234

Reputation: 11

Featuretools group by issue

I have a set of dataframes/entity set for rugby league/sports data: players, teams, venues, games, team_stats and player_stats

players: player_id, player_name

teams: team_id, team_name

games: game_id, venue_id

venues: venue_id, venue_coordinates

team_stats: game_id, team_id + relevant team statistics for that game

player_stats: game_id, team_id, player_id + relevant player statistics for that game

The end goal is to generate features for a given team in a given game based on aggregated historical player statistics. e.g. game_id, team_id and for example sum of the player mean of a given statistic

As a worked example if we had 2 players in a team who both had historical values of [1, 2] for a given statistic I would like that to show as sum of mean statistic being 3 for the team.

What would be the most efficient way to achieve this?

I have tried using group by trans primitives to no avail. I'm not sure if I need to add team_id into the players dataframe?

Upvotes: 1

Views: 65

Answers (1)

Nate Parsons
Nate Parsons

Reputation: 386

If I understand your question correctly, what you are trying to do is aggregate player stats from the player_stats table to the team_stats table. One way to do this would be to create a relationship between those two tables and use various aggregation primitives.

That relationship can be defined by the combination of (team_id, game_id) in the two tables. Unfortunately, Featuretools does not support defining relationships using multiple columns in one table, so you probably need to create a new column that can be used to define the relationship.

Here is one way you could do this:

import pandas as pd
import featuretools as ft

# Define team stats
team_stats = pd.DataFrame({
    "id": [0, 1, 2, 3],
    "team_id": [100, 200, 100, 200],
    "game_id": [0, 0, 1, 1],
})

# Define player stats
player_stats = pd.DataFrame({
    "id": [0, 1, 2, 3, 4, 5, 6, 7],
    "team_id": [100, 100, 200, 200, 100, 100, 200, 200],
    "game_id": [0, 0, 0, 0, 1, 1, 1, 1],
    "player_id": [0, 1, 2, 3, 0, 1, 2, 3],
    "goals": [1, 2, 0, 1, 2, 0, 1, 2],
    "minutes_played": [30, 45, 10, 50, 10, 40, 25, 45],
})

# Create new index columns for use in relationship
team_stats["created_idx"] = team_stats["team_id"].astype("string") + "-" + team_stats["game_id"].astype("string")
player_stats["created_idx"] = player_stats["team_id"].astype("string") + "-" + player_stats["game_id"].astype("string")

# Drop these columns from the players table since we no longer need them in this example.
# This prevents Featuretools from generating features from these numeric columns
player_stats = player_stats.drop(columns=["team_id", "game_id"])

# Create the EntitySet and add dataframes
es = ft.EntitySet()
es.add_dataframe(dataframe=team_stats, dataframe_name="teams", index="created_idx")
es.add_dataframe(dataframe=player_stats, dataframe_name="players", index="id")

# Add the relationship
es.add_relationship("teams", "created_idx", "players", "created_idx")

# Run DFS using the `Sum` aggregation primitive
# Use ignore columns to prevent generation of features from player_id
fm, feautres = ft.dfs(entityset=es,
                      target_dataframe_name="teams",
                      agg_primitives=["sum"],
                      ignore_columns={"players": ["player_id"]})

This will generate a dataframe like this:

             id  team_id  game_id  SUM(players.goals)  SUM(players.minutes_played)  SUM(players.player_id)
created_idx
100-0         0      100        0                 3.0                         75.0                     1.0
200-0         1      200        0                 1.0                         60.0                     5.0
100-1         2      100        1                 2.0                         50.0                     1.0
200-1         3      200        1                 3.0                         70.0                     5.0

Upvotes: 1

Related Questions