Reputation: 1079
I'm trying to understand how to specify primitive_options
in FeatureTools (version 0.16) to include only a certain entity. Based on the docs I should be using include_entities
:
List of entities to be included when creating features for the primitive(s). All other entities will be ignored (list[str]).
Here's some example code:
import pprint
from featuretools.primitives import GreaterThanScalar
esd1 = ft.demo.load_mock_customer(return_entityset=True)
def run_dfs(esd, primitive_options={}):
feature_defs = ft.dfs(
entityset=esd,
target_entity="customers",
agg_primitives=["count"],
where_primitives=["count",GreaterThanScalar(value=0)],
trans_primitives=[GreaterThanScalar(value=0)],
primitive_options=primitive_options,
max_depth=4,
features_only=True
)
pprint.pprint(feature_defs)
run_dfs(esd1)
This produces:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions) > 0>]
Suppose I'm interested in the sessions and transactions counts and whether sessions where larger than 0. Based on the docs I'd go for include_entities
here:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"include_entities":['sessions']}
})
The output from this, however, is:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>]
Both GreaterThanScalar features are gone now. If I use ignore_entities
instead I get:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
}
})
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>]
So it works, but I'm not sure why ignore_entities
gives the result I need and include_entities
does not. Am I missing something?
Although I sort of got the simple case to work, what I really want is something a bit more complicated. I'd like to to get a boolean feature that tells me whether there were more than zero sessions on a particular device.
Do do this:
esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)
yielding:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions) > 0>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(sessions WHERE device = desktop) > 0>,
<Feature: COUNT(sessions WHERE device = tablet) > 0>,
<Feature: COUNT(sessions WHERE device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]
The features I need are 4 to 6 counting from the bottom. If I try to restrict dfs
to limit itself to sessions entity and device variables:
run_dfs(esd2, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["device"]}
}
})
the result is:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>]
No GreaterThanScalar features.
Is there a way to make dfs
to give me just the three GreaterThanScalar features I want here?
Is there a way to limit what gets counted under where
? For example:
esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()
run_dfs(esd3, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions","sessions"],
},
"count":{
"ignore_variables":{"transactions":['session_id']}
}
})
gives:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE products.brand = B)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(transactions WHERE products.brand = A)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>]
Is it possible to limit the COUNT(transactions WHERE ...)
features to only products
. I'd still want to keep the COUNT sessions ...
features.
Upvotes: 0
Views: 164
Reputation: 191
Adding 'session_id' from the 'sessions' entity to the include_variables
option will generate the features you're looking for:
primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["session_id", "device"]}}}
The Count
primitive uses the entity index as its base, as well as any where
columns. If you only include the where
column for the GreaterThanScalar
primitive options, dfs
ends up ignoring all the Count
features for GreaterThanScalar
because they all use an implicitly ignored column (the entity index). In this case, the desired Count
variables use the 'sessions' entity, so adding the 'sessions' entity index ('session_id') to the included_variables
option allows for the desired features to be generated.
Also, in the first example using include_entities
, the GreaterThanScalar
features are lost because the 'customers' entity (the target entity) isn't included. The Count
features are all aggregation features in the 'customers' entity; they represent the count of something per each customer. In order to use the Count
features, the GreaterThanScalar
primitive needs to be allowed to use both the 'customers' entity where the Count
features are located as well as the entity that the desired Count
feature is based on ('sessions' in this case).
Upvotes: 3