Reputation: 951
I'm trying to use featuretools to calculate time-series functions. Specifically, I'd like to subtract current(x) from previous(x) by a group-key (user_id), but I'm having trouble in adding this kind of relationship in the entityset.
df = pd.DataFrame({
"user_id": [i % 2 for i in range(0, 6)],
'x': range(0, 6),
'time': pd.to_datetime(['2014-1-1 04:00', '2014-1-1 05:00',
'2014-1-1 06:00', '2014-1-1 08:00', '2014-1-1 10:00', '2014-1-1 12:00'])
})
print(df.to_string())
user_id x time
0 0 0 2014-01-01 04:00:00
1 1 1 2014-01-01 05:00:00
2 0 2 2014-01-01 06:00:00
3 1 3 2014-01-01 08:00:00
4 0 4 2014-01-01 10:00:00
5 1 5 2014-01-01 12:00:00
es = ft.EntitySet(id='test')
es.entity_from_dataframe(entity_id='data', dataframe=df,
variable_types={
'user_id': ft.variable_types.Categorical,
'x': ft.variable_types.Numeric,
'time': ft.variable_types.Datetime
},
make_index=True, index='index',
time_index='time'
)
I then try to invoke dfs, but I can't get the relationship right...
fm, fl = ft.dfs(
target_entity="data",
entityset=es,
trans_primitives=["diff"]
)
print(fm.to_string())
user_id x DIFF(x)
index
0 0 0 NaN
1 1 1 1.0
2 0 2 1.0
3 1 3 1.0
4 0 4 1.0
5 1 5 1.0
But what I'd actually want to get is the difference by user. That is, from the last value for each user:
user_id x DIFF(x)
index
0 0 0 NaN
1 1 1 NaN
2 0 2 2.0
3 1 3 2.0
4 0 4 2.0
5 1 5 2.0
How do I get this kind of relationship in featuretools? I've tried several tutorial but to no avail. I'm stumped.
Thanks!
Upvotes: 2
Views: 220
Reputation: 2123
Thanks for the question. You can get the expected output by normalizing an entity for users and applying a group by transform primitive. I'll go through a quick example using this data.
user_id x time
0 0 2014-01-01 04:00:00
1 1 2014-01-01 05:00:00
0 2 2014-01-01 06:00:00
1 3 2014-01-01 08:00:00
0 4 2014-01-01 10:00:00
1 5 2014-01-01 12:00:00
First, create the entity set and normalize an entity for the users.
es = ft.EntitySet(id='test')
es.entity_from_dataframe(
dataframe=df,
entity_id='data',
make_index=True,
index='index',
time_index='time',
)
es.normalize_entity(
base_entity_id='data',
new_entity_id='users',
index='user_id',
)
Then, apply the group by transform primitive in DFS.
fm, fl = ft.dfs(
target_entity="data",
entityset=es,
groupby_trans_primitives=["diff"],
)
fm.filter(regex="DIFF", axis=1)
You should get the difference by user.
DIFF(x) by user_id
index
0 NaN
1 NaN
2 2.0
3 2.0
4 2.0
5 2.0
Upvotes: 2