Reputation: 2416
I am trying to figure out how Featuretools works and I am testing it on the Housing Prices dataset on Kaggle. Because the dataset is huge, I'll work here with only a set of it.
The dataframe is:
train=pd.DataFrame({
'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60},
'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'},
'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0},
'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}
})
I set de dataframe properties:
dataframes = {'train': (train, 'Id')}
Then call the dfs
method:
train_feature_matrix, train_feature_names = ft.dfs(dataframes=dataframes, target_dataframe_name='train', max_depth=10, agg_primitives=["mean", "sum", "mode"])
I get the following warning:
UnusedPrimitiveWarning: Some specified primitives were not used during DFS: agg_primitives: ['mean', 'mode', 'sum'] This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data. If the DFS call contained multiple instances of a primitive in the list above, none of them were used. warnings.warn(warning_msg, UnusedPrimitiveWarning)
And the train_feature_matrix
is exactly as the original train
dataframe.
At first, I said that this is because I have a small dataframe and nothing useful can be extracted. But I get the same behavior with the entire dataframe (80 columns and 1460 rows).
Every example I saw on the Featuretools page had 2+ dataframes, but I only have one.
Can you shed some light here? What am I doing wrong?
Upvotes: 0
Views: 710
Reputation: 1
If you only had a data, the library of "headjackai" is more fit on your situation than featuretools. In this library, the feature engineering function were made from datasets, technical speaking, the library provided a embedding space to exchange features on multi-domain in tabular dataset that we can apply the feature from the titanic domain to improve house pricing task.
It is an open community, so you can create many new feature engineering function by yourself or apply others people made in public feature model pool. It has more than a hundred feature model now.
for example,
from headjackai.headjackai_hub import headjackai_hub
# headjaack experiment
#host setting
hj_hub = headjackai_hub('http://www.headjackai.com:9000')
#account login
hj_hub.login(username='jimliu_stackoverflow', pwd='jimliu_stackoverflow')
pool_list = hj_hub.knowledgepool_check(True)
score_list = []
task_list = []
# try each feature model
for source in pool_list:
hj_X = hj_hub.knowledge_transform(data=X,
target_domain='boston_comparsion',
source_domain=source,
label='')
N_SPLITS = 5
strat_kf = KFold(n_splits=N_SPLITS, shuffle=True, random_state=8888)
tr_scores = np.empty(N_SPLITS)
scores = np.empty(N_SPLITS)
try:
# cv-5, lgbm, mae
for idx, (train_idx, test_idx) in enumerate(strat_kf.split(X, y)):
X_train, X_test = hj_X.iloc[train_idx], hj_X.iloc[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
cb_clf = lgbm.LGBMRegressor()
cb_clf.fit(X_train,y_train)
preds = cb_clf.predict(X_test)
loss = mean_absolute_error(y_test, preds)
scores[idx] = loss
preds = cb_clf.predict(X_train)
loss = mean_absolute_error(y_train, preds)
tr_scores[idx] = loss
print("-----------------",source,"-----------------")
print(f"mean score: {tr_scores.mean():.5f}")
print(f"mean score: {scores.mean():.5f}")
score_list.append(scores.mean())
task_list.append(source)
except:
pass
arg_index = score_list.index(min(score_list))
print(task_list[arg_index], min(score_list))
# ames-house 2.1316169625933044
In the above code sample, I try each feature model on boston pricing task, and pick the best one as our feature engineering function.
In this library, you can get many automated feature generation, even if a single dataset.
Upvotes: 0
Reputation: 101
Aggregation primitives cannot create features on an EntitySet with a single DataFrame.
This is because the aggregation that they perform occurs over the the one-to-many relationship that exists when you have a parent-child relationship between DataFrames in an EntitySet. The Featuretools guide on primitives has a section that explains the difference here. With your data, that might look like a child DataFrame that has a non-unique house_id
column over. Then, running dfs on your train
DataFrame would aggregate the desired information for each Id
, using every time it shows up in the child DataFrame.
To get get automated feature generation with a single DataFrame, you should use Transform features. The available Transform Primitives can be found here.
Upvotes: 1