Reputation: 1316
I have a dataset that I have run a K-means algorithm on (scikit-learn), and I want to build a decision tree on each cluster. I can recuperate the values from the cluster, but not the "class" values (I'm doing supervised learning, each element can belong to one of two classes and I need the value associated with the data to build my trees)
Ex: unfiltered data set:
[val1 val2 class]
X_train=[val1 val2]
y_train=[class]
The clustering code is this:
X = clusterDF[clusterDF.columns[clusterDF.columns.str.contains('\'AB\'')]]
y = clusterDF['Class']
(X_train, X_test, y_train, y_test) = train_test_split(X, y,
test_size=0.30)
kmeans = KMeans(n_clusters=3, n_init=5, max_iter=3000, random_state=1)
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
And this is my (unbelievably clunky!) code for extracting the values to build the tree. The issue is the Y values; they aren't consistent with the X values
cl={i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)}
for j in range(0,len(k_means_labels_unique)):
Xc=None
Y=None
#for i in range(0,len(k_means_labels_unique)):
indexes = cl.get(j,0)
for i, row in X.iterrows():
if i in indexes:
if Xc is not None:
Xc = np.vstack([Xc, [row['first occurrence of \'AB\''],row['similarity to \'AB\'']]])
else:
Xc = np.array([row['first occurrence of \'AB\''],row['similarity to \'AB\'']])
if Y is not None:
Y = np.vstack([Y, y[i]])
else:
Y = np.array(y[i])
Xc = pd.DataFrame(data=Xc, index=range(0, len(X)),
columns=['first occurrence of \'AB\'',
'similarity to \'AB\'']) # 1st row as the column names
Y = pd.DataFrame(data=Y, index=range(0, len(Y)),columns=['Class'])
print("\n\t-----Classifier ", j + 1,"----")
(X_train, X_test, y_train, y_test) = train_test_split(X, Y,
test_size=0.30)
classifier = DecisionTreeClassifier(criterion='entropy',max_depth = 2)
classifier = getResults(
X_train,
y_train,
X_test,
y_test,
classifier,
filename='classif'+str(3 + i),
)
Any ideas (or downright more efficient ways) of taking the clustered data to make a decision tree from?
Upvotes: 0
Views: 743
Reputation: 4150
Did not read all the code but my guess is that passing an index vector into the train_test_split
function would help you keep track of the samples.
X = clusterDF[clusterDF.columns[clusterDF.columns.str.contains('\'AB\'')]]
y = clusterDF['Class']
indices = clusterDF.index
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X, y, indices)
Upvotes: 1