marco_van_cerbex
marco_van_cerbex

Reputation: 53

Python. How to import my own dataset to "k means" algorithm

I want to import my own data (sentences which are located in a .txt file) into this example algorithm, which can be found at: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

The problem is that this code uses a make_blobs dataset and i have a hard time understanding how to replace it with data from .txt file.

All I predict is that I need to replace this piece of code right here:

X, y = make_blobs(n_samples=500,
          n_features=2,
          centers=4,
          cluster_std=1,
          center_box=(-10.0, 10.0),
          shuffle=True,
          random_state=1)  # For reproducibility

Also I do not understand these variables X, y . I assume that X is an array of data, and what about y?

Should I just assign everything to the X as like this and that example code would work? But what about those make_blobs features like centers, n_features etc.? Do I need to specify them somehow differently?

# open and read from the txt file
path = "C:/Users/user/Desktop/sentences.txt"
file = open(path, 'r')
# assign it to the X
X = file.readlines() 

Any help is appreciated!

Upvotes: 0

Views: 392

Answers (1)

Eypros
Eypros

Reputation: 5723

Firstly, you need to create a mapping of your words to a number that your k-means algorithm can use.

For example:

I ride a bike and I like it.
1   2  3  4    5  1  6   7  # <- number ids

After that you have a new embedding for you dataset and you can apply k-means. If you want a homogeneous appearance for your sample you must convert them to one-hot-representation (which is that you create a N-length array for each sample, where N is the total number of unique words you have, which has one to the corresponding position which is the same as the index of the sample).

Example for the above for N = 7 would be

1 -> 1000000
2 -> 0100000
...

So, now you can have a X variable containing your data in a proper format. You don't need y which is the corresponding labels for your samples.

clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
...

Upvotes: 1

Related Questions