Reputation: 13
Say I have this simple snippet of code. I will group, aggregate, and merge the dataframe:
df = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [1, 2, 3, 4],
'C': [0.3, 0.2, 1.2, -0.5]})
df:
A B C
0 1 1 0.3
1 1 2 0.2
2 2 3 1.2
3 2 4 -0.5
df_result = df.groupby('A').agg('min')
df_result.columns = ['groupby_A(min_'+x+')' for x in df_result.columns]
df_result:
groupby_A(min_B) groupby_A(min_C)
A
1 1 0.2
2 3 -0.5
df_new = pd.merge(df,df_result,on='A')
df_new
df_new:
A B C groupby_A(min_B) groupby_A(min_C)
0 1 1 0.3 1 0.2
1 1 2 0.2 1 0.2
2 2 3 1.2 3 -0.5
3 2 4 -0.5 3 -0.5
# ---- Import the Module ----
import featuretools as ft
# ---- Make the Entity Set (the set of all tables) ----
es = ft.EntitySet()
# ---- Make the Entity (the table) ----
es.entity_from_dataframe(entity_id = 'df',
dataframe = df)
# ---- Do the Deep Feature Synthesis (group, aggregate, and merge the features) ----
feature_matrix, feature_names = ft.dfs(entityset = es,
target_entity = 'df',
trans_primitives = ['cum_min'])
feature_matrix
feature_matrix:
A B C CUM_MIN(A) CUM_MIN(B) CUM_MIN(C)
index
0 1 1 0.3 1 1 0.3
1 1 2 0.2 1 1 0.2
2 2 3 1.2 1 1 0.2
3 2 4 -0.5 1 1 -0.5
How does the operation with Pandas translate into featuretools (preferably without adding another table)?
My attempt with featuretools does not give the right output, but I believe the process that I used is somewhat correct.
Upvotes: 1
Views: 739
Reputation: 2014
Here is the recommended way to do it in Featuretools. You do need to create another table to make it work exactly as you want.
import featuretools as ft
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [1, 2, 3, 4],
'C': [0.3, 0.2, 1.2, -0.5]})
es = ft.EntitySet()
es.entity_from_dataframe(entity_id="example",
index="id",
make_index=True,
dataframe=df)
es.normalize_entity(new_entity_id="a_entity",
base_entity_id="example",
index="A")
fm, fl = ft.dfs(target_entity="example",
entityset=es,
agg_primitives=["min"])
fm
this returns
A B C a_entity.MIN(example.B) a_entity.MIN(example.C)
id
0 1 1 0.3 1 0.2
1 1 2 0.2 1 0.2
2 2 3 1.2 3 -0.5
3 2 4 -0.5 3 -0.5
If you don't want to create an extra table you could try using the cum_min
primitive which calculate the cumulative after grouping by A
df = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [1, 2, 3, 4],
'C': [0.3, 0.2, 1.2, -0.5]})
es = ft.EntitySet()
es.entity_from_dataframe(entity_id="example",
index="id",
make_index=True,
variable_types={
"A": ft.variable_types.Id
},
dataframe=df,)
fm, fl = ft.dfs(target_entity="example",
entityset=es,
groupby_trans_primitives=["cum_min"])
fm
this returns
B C A CUM_MIN(C) by A CUM_MIN(B) by A
id
0 1 0.3 1 0.3 1.0
1 2 0.2 1 0.2 1.0
2 3 1.2 2 1.2 3.0
3 4 -0.5 2 -0.5 3.0
Upvotes: 2