Reputation: 615
I have recently got in to using SKLearn, especially Classification models and had a question more on use case examples, than being stuck on any particular bit of code, so apolgies in advance if this isn't the right place to be asking questions such as this.
So far I have been using sample data where one trains the model based on data that has already been classified. The 'Iris' data set for example, all the data is classified in to one of the three species. But what if one wants to group/classify the data without knowing the classifications in the first place.
Let's take this imaginary data:
Name Feat_1 Feat_2 Feat_3 Feat_4
0 A 12 0.10 0 9734
1 B 76 0.03 1 10024
2 C 97 0.07 1 8188
3 D 32 0.21 1 6420
4 E 45 0.15 0 7723
5 F 61 0.02 1 14987
6 G 25 0.22 0 5290
7 H 49 0.30 0 7107
If one wanted to split the names in to 4 separate classifications, using the different features, is this possible, and which SKLearn model(s) is needed? I'm not asking for any code, I'm quite able to research on my own if someone could point me in the right direction? So far I can only find examples where the classifications are already known.
In the example above, if I wanted to break the data down in to 4 classifications I would want my outcome to be something like this (note the new column, denoting the class):
Name Feat_1 Feat_2 Feat_3 Feat_4 Class
0 A 12 0.10 0 9734 4
1 B 76 0.03 1 10024 1
2 C 97 0.07 1 8188 3
3 D 32 0.21 1 6420 3
4 E 45 0.15 0 7723 2
5 F 61 0.02 1 14987 1
6 G 25 0.22 0 5290 4
7 H 49 0.30 0 7107 4
Many thanks for any help
Upvotes: 0
Views: 211
Reputation: 6270
This topic is called: unsupervised learning
Some definition is:
Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows modeling probability densities of given inputs.[1] It is one of the main three categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques.
There are tons of algorithms out there, you need to try what fits best for your algorithms, some examples are:
Upvotes: 1
Reputation: 6032
Start with an unsupervised method to determine clusters... use those clusters as your labels.
I recommend using sklearn's GMM
instead of k-means
.
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
K-means assumes circular clusters.
Upvotes: 1
Reputation: 88
Classification is a supervised approach, meaning that the training data comes with features and labels. If you want to group the data according to the features, then you can go for some clustering algorithms (unsupervised), such as sklearn.cluster.KMeans (with k = 4).
Upvotes: 1
Reputation: 5481
you can you k-mean clustering which will group data into lesser in lesser classes in each iteration until all data are grouped in 1 group. Then you can either stop the iteration early when number of classes are what you wanted or you can also go back on already trained model to get number of class you want. For example to get 4 classes you can go 4 steps back when data are clustered in 4 classes
Upvotes: 1