Reputation: 11
My initial data is :
data_init = pd.read_csv('data_merged.csv')
Total periode to cover 25 months
initial_period_data = data_init[(data_init['order_purchase_timestamp'] >= earliest_timestamp) & (data_init['order_purchase_timestamp'] < initial_period_end)]
process_data(initial_period_data) #it is a function to recalculate all the features on this date
I select then numerical variables and apply scaler to normalise :
initial_period_data_num = initial_period_data.select_dtypes(include="number").fillna(0)
scaler_init = StandardScaler()
initial_period_data_num_scaled = scaler_init.fit_transform(initial_period_data_num)
Then, I train Kmeans
model_init = KMeans(n_clusters=6)
model_init = model_init.fit(initial_period_data_num_scaled)
model_init_labels = model_init.labels_
initial_period_data['labels'] = model_init_labels
from sklearn.metrics import adjusted_rand_score
from sklearn.preprocessing import StandardScaler
import numpy as np
initial_period_end = initial_period_data['order_purchase_timestamp'].max()
period_end = data_init['order_purchase_timestamp'].max()
** I block from there** I want to iterate over the whole period starting from the initial period of 12 months on period intervals of 2 weeks until the end of the period to be covered and then compare the initial clusters and new clusters
# Liste pour stocker les scores ARI
ari_scores = []
total_periods = 25
for p in range(2, total_periods + 1, 2):
current_period_end = initial_period_end + pd.DateOffset(weeks=p)
data_period = data_init[(data_init['order_purchase_timestamp'] >= initial_period_end) & (data_init['order_purchase_timestamp'] < current_period_end)]
data_period_group = process_data(data_period)
# Sélection colonnes numériques
data_period_num = data_period_group.select_dtypes(include="number").fillna(0)
# Scaler init
data_period_scaled = scaler_init.fit_transform(data_period_num)
# KMeans init
model_init.predict(data_period_scaled)
p_labels = model_init.labels_
# Calculer le score ARI pour la période actuelle
ari_p = adjusted_rand_score(model_init_labels, p_labels)
ari_scores.append([p, ari_p])
But I have this error ValueError: Found input variables with inconsistent numbers of samples: [28876, 2050] Can you please tell me where I am wrong
Hello everybody, I need your help In order to establish a maintenance contract for the customer segmentation algorithm, I must test its stability over time and see, for example, when customers change clusters.For that, I have to recalculate all the features according to a given period. I want to iterate over the whole period starting from the initial period of 12 months on period intervals of 2 weeks until the end of the period to be covered and then compare the initial clusters and new clusters. But I have this error ValueError: Found input variables with inconsistent numbers of samples: [28876, 2050] Can you please tell me where I am wrong
Upvotes: 1
Views: 29