numentar
numentar

Reputation: 1079

Restricting feature generation to a particular entity in FeatureTools

I'm trying to understand how to specify primitive_options in FeatureTools (version 0.16) to include only a certain entity. Based on the docs I should be using include_entities:

List of entities to be included when creating features for the primitive(s). All other entities will be ignored (list[str]).

Simple case

Here's some example code:

import pprint
from featuretools.primitives import GreaterThanScalar

esd1 = ft.demo.load_mock_customer(return_entityset=True)

def run_dfs(esd, primitive_options={}):
    feature_defs = ft.dfs(
        entityset=esd,
        target_entity="customers",
        agg_primitives=["count"],
        where_primitives=["count",GreaterThanScalar(value=0)],
        trans_primitives=[GreaterThanScalar(value=0)],
        primitive_options=primitive_options,
        max_depth=4,
        features_only=True
    )
    pprint.pprint(feature_defs)

run_dfs(esd1)

This produces:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions) > 0>,
 <Feature: COUNT(transactions) > 0>]

Suppose I'm interested in the sessions and transactions counts and whether sessions where larger than 0. Based on the docs I'd go for include_entities here:

run_dfs(esd1, primitive_options={
          "greater_than_scalar":{
              "include_entities":['sessions']}
        })

The output from this, however, is:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>]

Both GreaterThanScalar features are gone now. If I use ignore_entities instead I get:

run_dfs(esd1, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions"],
            }
        })

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions) > 0>]

So it works, but I'm not sure why ignore_entities gives the result I need and include_entities does not. Am I missing something?

More complex case

Although I sort of got the simple case to work, what I really want is something a bit more complicated. I'd like to to get a boolean feature that tells me whether there were more than zero sessions on a particular device.

Do do this:

esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)

yielding:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions) > 0>,
 <Feature: COUNT(sessions) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>,
 <Feature: COUNT(sessions WHERE device = desktop) > 0>,
 <Feature: COUNT(sessions WHERE device = tablet) > 0>,
 <Feature: COUNT(sessions WHERE device = mobile) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]

The features I need are 4 to 6 counting from the bottom. If I try to restrict dfs to limit itself to sessions entity and device variables:

run_dfs(esd2, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions"],
                "include_variables":{"sessions":["device"]}
            }
        })

the result is:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>]

No GreaterThanScalar features.

Is there a way to make dfs to give me just the three GreaterThanScalar features I want here?

Update: Third case

Is there a way to limit what gets counted under where? For example:

esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()

run_dfs(esd3, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions","sessions"],
            },
            "count":{
                "ignore_variables":{"transactions":['session_id']}
            }
        })

gives:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE products.brand = B)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>,
 <Feature: COUNT(transactions WHERE products.brand = A)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>]

Is it possible to limit the COUNT(transactions WHERE ...) features to only products. I'd still want to keep the COUNT sessions ... features.

Upvotes: 0

Views: 164

Answers (1)

Frances Hartwell
Frances Hartwell

Reputation: 191

Adding 'session_id' from the 'sessions' entity to the include_variables option will generate the features you're looking for:

primitive_options={
    "greater_than_scalar":{
         "ignore_entities":["transactions"],
         "include_variables":{"sessions":["session_id", "device"]}}}

The Count primitive uses the entity index as its base, as well as any where columns. If you only include the where column for the GreaterThanScalar primitive options, dfs ends up ignoring all the Count features for GreaterThanScalar because they all use an implicitly ignored column (the entity index). In this case, the desired Count variables use the 'sessions' entity, so adding the 'sessions' entity index ('session_id') to the included_variables option allows for the desired features to be generated.

Also, in the first example using include_entities, the GreaterThanScalar features are lost because the 'customers' entity (the target entity) isn't included. The Count features are all aggregation features in the 'customers' entity; they represent the count of something per each customer. In order to use the Count features, the GreaterThanScalar primitive needs to be allowed to use both the 'customers' entity where the Count features are located as well as the entity that the desired Count feature is based on ('sessions' in this case).

Upvotes: 3

Related Questions