How to add cluster label columns back into original dataframe- python, for supervised learning

Question

I have a column in my data frame which contains Url information. It has 1200+ unique values. I wanted to use text mining to generate features from these values. I have used tfidfvectorizer to generate vectors and then used kmeans to identify clusters. I now want to assign these cluster labels back into my original dataframe, so that I can bin the URL information into these clusters.

Below code to generate vectors and cluster labels

from scipy.spatial.distance import cdist


vectorizer = TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english')
X = vectorizer.fit_transform(sample$$'lead_lead_source_modified'$$)
X = X.toarray()
distortions=
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(X)
    kmeanModel.fit(X)
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape$$0$$)

#append cluster labels

km = KMeans(n_clusters=4, random_state=0)
km.fit_transform(X)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=$$'ClusterLabel_lead_lead_source'$$)
cluster_labels

Through the elbow method, I decided on 4 clusters. I now have cluster labels, but I am not sure how to add them bank to dataframe on its respective index. Concatenating along axis=1 is creating Nans due to indexing issues. Below is the sample output after concatenation.

    lead_lead_source_modified   ClusterLabel_lead_lead_source
0   NaN                          3.0
1   NaN                          0.0
2   NaN                          0.0
3   ['direct', 'salesline', 'website', '']  0.0

I want to know if this approach is the right way to do, if so then how to solve this issue. If not, is there a better way to do.

How to add cluster label columns back into original dataframe- python, for supervised learning

Answers (1)

Related Questions