Reputation: 65
I have this 3 dataframes:
df_train cortado:____________________
SK_ID_CURR TARGET NAME_CONTRACT_TYPE_Cash loans \
0 100002 1 1
1 100003 0 1
2 100004 0 0
3 100006 0 1
4 100007 0 1
NAME_CONTRACT_TYPE_Revolving loans CODE_GENDER_F CODE_GENDER_M
0 0 0 1
1 0 1 0
2 1 0 1
3 0 1 0
4 0 0 1
df_bureau cortado:____________________
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE_Active
0 100002 5714464 1
1 100002 5714465 1
2 215354 5714466 1
3 215354 5714467 1
4 215354 5714468 1
bureau_balance cortado 3:____________________
SK_ID_BUREAU MONTHS_BALANCE STATUS_C
0 5715448 0 1
1 5715448 -1 1
2 5715448 -2 1
3 5715448 -3 1
4 5715448 -4 1
And this is the script I am trying to run to feature synthesis:
entities = {
"train" : (df_train, "SK_ID_CURR"),
"bureau" : (df_bureau, "SK_ID_BUREAU"),
"bureau_balance" : (df_bureau_balance,"MONTHS_BALANCE", "STATUS", "SK_ID_BUREAU") ,
}
relationships = [
("bureau", "SK_ID_BUREAU", "bureau_balance", "SK_ID_BUREAU"),
("train", "SK_ID_CURR", "bureau", "SK_ID_CURR")
]
feature_matrix_customers, features_defs = ft.dfs(entities=entities,
relationships=relationships,
target_entity="train"
)
But, whever I introduce the column "STATUS", this error happens: TypeError: 'str' object does not support item assignment
If I don't put the column "STATUS", it is ok with few rows of the dataframe. When the number of rows increases (and only putting STATUS as key would solve it), this other error happens: AssertionError: Index is not unique on dataframe (Entity bureau_balance)
Thanks in advance!!
Upvotes: 4
Views: 822
Reputation: 2014
caseWestern's answer is the recommended way to create an EntitySet
in Featuretools.
That being said, the error you are seeing is because Featuretools is expecting the 4 values for the entity to be where variable types is a dictionary dict[str -> Variable]. Right now, you are only passing in a string for the 4th parameter, so Featuretools fails when tries to add entries because it isn't actually a dictionary.
You can see the documentation for Entity Set for more information.
Upvotes: 0
Reputation: 3827
You are right in that the dataframes need a unique index to be made an entity. One simple option is to add a unique index to df_bureau_balance
using
df_bureau_balance.reset_index(inplace = True)
and then making the entities:
entities = {
"train" : (df_train, "SK_ID_CURR"),
"bureau" : (df_bureau, "SK_ID_BUREAU"),
"bureau_balance" : (df_bureau_balance, "index")
}
A much better option is to use entitysets to represent your data. When we create an entity from df_bureau_balance
, because it does not have a unique index, we pass in make_index = True
and a name for the index (this can be any name provided it is not already a column in the data.) The rest is very similar to your work just with slightly different syntax! Here is a complete working example:
# Create the entityset
es = ft.EntitySet('customers')
# Add the entities to the entityset
es = es.entity_from_dataframe('train', df_train, index = 'SK_ID_CURR')
es = es.entity_from_dataframe('bureau', df_bureau, index = 'SK_ID_BUREAU')
es = es.entity_from_dataframe('bureau_balance', df_bureau_balance,
make_index = True, index = 'bureau_balance_index')
# Define the relationships
r_train_bureau = ft.Relationship(es['train']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])
r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'],
es['bureau_balance']['SK_ID_BUREAU'])
# Add the relationships
es = es.add_relationships([r_train_bureau, r_bureau_balance])
# Deep feature synthesis
feature_matrix_customers, feature_defs = ft.dfs(entityset=es, target_entity = 'train')
Entitysets help you keep track of all your data in a single structure! The Featuretools documentation is good for getting down the basics of using entitysets and I would recommend giving it a read.
Upvotes: 4