Reputation: 61
I have a column in my data frame which contains Url information. It has 1200+ unique values. I wanted to use text mining to generate features from these values. I have used tfidfvectorizer to generate vectors and then used kmeans to identify clusters. I now want to assign these cluster labels back into my original dataframe, so that I can bin the URL information into these clusters.
Below code to generate vectors and cluster labels
from scipy.spatial.distance import cdist
vectorizer = TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english')
X = vectorizer.fit_transform(sample\['lead_lead_source_modified'\])
X = X.toarray()
distortions=\[\]
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape\[0\])
#append cluster labels
km = KMeans(n_clusters=4, random_state=0)
km.fit_transform(X)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=\['ClusterLabel_lead_lead_source'\])
cluster_labels
Through the elbow method, I decided on 4 clusters. I now have cluster labels, but I am not sure how to add them bank to dataframe on its respective index. Concatenating along axis=1 is creating Nans due to indexing issues. Below is the sample output after concatenation.
lead_lead_source_modified ClusterLabel_lead_lead_source
0 NaN 3.0
1 NaN 0.0
2 NaN 0.0
3 ['direct', 'salesline', 'website', ''] 0.0
I want to know if this approach is the right way to do, if so then how to solve this issue. If not, is there a better way to do.
Upvotes: 3
Views: 2736
Reputation: 61
Adding index value during dataframe conversion solved the issue.
But it still want to know if this is the right approach
Upvotes: 1