Reputation: 53
I want to import my own data (sentences which are located in a .txt file) into this example algorithm, which can be found at: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
The problem is that this code uses a make_blobs dataset and i have a hard time understanding how to replace it with data from .txt file.
All I predict is that I need to replace this piece of code right here:
X, y = make_blobs(n_samples=500,
n_features=2,
centers=4,
cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1) # For reproducibility
Also I do not understand these variables X, y . I assume that X is an array of data, and what about y?
Should I just assign everything to the X as like this and that example code would work? But what about those make_blobs features like centers, n_features etc.? Do I need to specify them somehow differently?
# open and read from the txt file
path = "C:/Users/user/Desktop/sentences.txt"
file = open(path, 'r')
# assign it to the X
X = file.readlines()
Any help is appreciated!
Upvotes: 0
Views: 392
Reputation: 5723
Firstly, you need to create a mapping of your words to a number that your k-means algorithm can use.
For example:
I ride a bike and I like it.
1 2 3 4 5 1 6 7 # <- number ids
After that you have a new embedding for you dataset and you can apply k-means. If you want a homogeneous appearance for your sample you must convert them to one-hot-representation (which is that you create a N-length array for each sample, where N is the total number of unique words you have, which has one to the corresponding position which is the same as the index of the sample).
Example for the above for N = 7 would be
1 -> 1000000
2 -> 0100000
...
So, now you can have a X
variable containing your data in a proper format. You don't need y
which is the corresponding labels for your samples.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
...
Upvotes: 1