michael giacomazza
michael giacomazza

Reputation: 13

How does df.groupby('A').agg('min') translate to featuretools?

Say I have this simple snippet of code. I will group, aggregate, and merge the dataframe:


Using Pandas:


Data

df = pd.DataFrame({'A': [1, 1, 2, 2],
               'B': [1, 2, 3, 4],
               'C': [0.3, 0.2, 1.2, -0.5]})
df:
    A   B   C
0   1   1   0.3
1   1   2   0.2
2   2   3   1.2
3   2   4   -0.5

Group and Aggregate

df_result = df.groupby('A').agg('min')
df_result.columns =  ['groupby_A(min_'+x+')' for x in df_result.columns]
df_result:
    groupby_A(min_B)    groupby_A(min_C)
A       
1   1                   0.2
2   3                   -0.5

Merge

df_new = pd.merge(df,df_result,on='A')
df_new
df_new:
    A   B   C       groupby_A(min_B)    groupby_A(min_C)
0   1   1   0.3     1                   0.2
1   1   2   0.2     1                   0.2
2   2   3   1.2     3                  -0.5
3   2   4   -0.5    3                  -0.5

An Attempt using featuretools:


# ---- Import the Module ----
import featuretools as ft

# ---- Make the Entity Set (the set of all tables) ----
es = ft.EntitySet()

# ---- Make the Entity (the table) ----
es.entity_from_dataframe(entity_id = 'df', 
                         dataframe = df)


# ---- Do the Deep Feature Synthesis (group, aggregate, and merge the features) ----
feature_matrix, feature_names = ft.dfs(entityset = es, 
                                       target_entity = 'df',
                                       trans_primitives = ['cum_min'])

feature_matrix
feature_matrix:
        A       B       C       CUM_MIN(A)  CUM_MIN(B)  CUM_MIN(C)
index                       
0       1       1       0.3     1           1           0.3
1       1       2       0.2     1           1           0.2
2       2       3       1.2     1           1           0.2
3       2       4       -0.5    1           1           -0.5

How does the operation with Pandas translate into featuretools (preferably without adding another table)?

My attempt with featuretools does not give the right output, but I believe the process that I used is somewhat correct.

Upvotes: 1

Views: 739

Answers (1)

Max Kanter
Max Kanter

Reputation: 2014

Here is the recommended way to do it in Featuretools. You do need to create another table to make it work exactly as you want.

import featuretools as ft
import pandas as pd

df = pd.DataFrame({'A': [1, 1, 2, 2],
                   'B': [1, 2, 3, 4],
                   'C': [0.3, 0.2, 1.2, -0.5]})

es = ft.EntitySet()

es.entity_from_dataframe(entity_id="example",
                          index="id",
                          make_index=True,
                          dataframe=df)

es.normalize_entity(new_entity_id="a_entity",
                    base_entity_id="example",
                    index="A")

fm, fl = ft.dfs(target_entity="example",
                entityset=es,
                agg_primitives=["min"])

fm

this returns

    A  B    C  a_entity.MIN(example.B)  a_entity.MIN(example.C)
id                                                             
0   1  1  0.3                        1                      0.2
1   1  2  0.2                        1                      0.2
2   2  3  1.2                        3                     -0.5
3   2  4 -0.5                        3                     -0.5

If you don't want to create an extra table you could try using the cum_min primitive which calculate the cumulative after grouping by A

df = pd.DataFrame({'A': [1, 1, 2, 2],
                   'B': [1, 2, 3, 4],
                   'C': [0.3, 0.2, 1.2, -0.5]})

es = ft.EntitySet()

es.entity_from_dataframe(entity_id="example",
                          index="id",
                          make_index=True,
                          variable_types={
                              "A": ft.variable_types.Id
                          },
                          dataframe=df,)

fm, fl = ft.dfs(target_entity="example",
                entityset=es,
                groupby_trans_primitives=["cum_min"])

fm

this returns

    B    C  A  CUM_MIN(C) by A  CUM_MIN(B) by A
id                                             
0   1  0.3  1              0.3              1.0
1   2  0.2  1              0.2              1.0
2   3  1.2  2              1.2              3.0
3   4 -0.5  2             -0.5              3.0

Upvotes: 2

Related Questions