Naina
Naina

Reputation: 11

KMeans / ValueError: Found input variables with inconsistent numbers of samples:

My initial data is :

data_init = pd.read_csv('data_merged.csv')

Total periode to cover 25 months

initial_period_data = data_init[(data_init['order_purchase_timestamp'] >= earliest_timestamp) & (data_init['order_purchase_timestamp'] < initial_period_end)]

process_data(initial_period_data) #it is a function to recalculate all the features on this date 

I select then numerical variables and apply scaler to normalise :

initial_period_data_num = initial_period_data.select_dtypes(include="number").fillna(0)
scaler_init = StandardScaler()
initial_period_data_num_scaled = scaler_init.fit_transform(initial_period_data_num)

Then, I train Kmeans

model_init = KMeans(n_clusters=6)
model_init = model_init.fit(initial_period_data_num_scaled)
model_init_labels = model_init.labels_
initial_period_data['labels'] = model_init_labels


from sklearn.metrics import adjusted_rand_score
from sklearn.preprocessing import StandardScaler
import numpy as np

initial_period_end = initial_period_data['order_purchase_timestamp'].max()
period_end = data_init['order_purchase_timestamp'].max()

** I block from there** I want to iterate over the whole period starting from the initial period of 12 months on period intervals of 2 weeks until the end of the period to be covered and then compare the initial clusters and new clusters

# Liste pour stocker les scores ARI
ari_scores = []

total_periods = 25

for p in range(2, total_periods + 1, 2):
    current_period_end = initial_period_end + pd.DateOffset(weeks=p)
    data_period = data_init[(data_init['order_purchase_timestamp'] >= initial_period_end) & (data_init['order_purchase_timestamp'] < current_period_end)]
    
    data_period_group = process_data(data_period)

    # Sélection colonnes numériques
    data_period_num = data_period_group.select_dtypes(include="number").fillna(0)

    # Scaler init
    
    data_period_scaled = scaler_init.fit_transform(data_period_num)

    # KMeans init
    model_init.predict(data_period_scaled)
    p_labels = model_init.labels_



    # Calculer le score ARI pour la période actuelle
    ari_p = adjusted_rand_score(model_init_labels, p_labels)
    ari_scores.append([p, ari_p])

But I have this error ValueError: Found input variables with inconsistent numbers of samples: [28876, 2050] Can you please tell me where I am wrong

Hello everybody, I need your help In order to establish a maintenance contract for the customer segmentation algorithm, I must test its stability over time and see, for example, when customers change clusters.For that, I have to recalculate all the features according to a given period. I want to iterate over the whole period starting from the initial period of 12 months on period intervals of 2 weeks until the end of the period to be covered and then compare the initial clusters and new clusters. But I have this error ValueError: Found input variables with inconsistent numbers of samples: [28876, 2050] Can you please tell me where I am wrong

Upvotes: 1

Views: 29

Answers (0)

Related Questions