Reputation: 35
I need to create my own data to develop a classifier and I don't know how.
Upvotes: -1
Views: 139
Reputation: 14072
You can do that very efficiently by using sklearn.datasets.make_classification
.
It generates a random n-class
classification problem, with a lot of options and high flexibility.
Example:
X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
n_redundant=0, n_repeated=0, n_clusters_per_class=1,
n_classes=2, shuffle=True, random_state=2021)
The above one liner, creates a data set with 100 samples, 2 features (all of them are informative), 2 classes and 1 cluster per class, then it shuffles them. The random_state
is just to make the process reproducible.
Then you can plot it as:
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y, s=25, edgecolor='k')
plt.show()
Sample of how the output would look like:
Upvotes: 1
Reputation: 3272
You can just create normals with specific mean and std.
import numpy as np
import matplotlib.pyplot as plt
std = [[0.5, 0], [0, 0.5]]
X1 = np.random.multivariate_normal([2, -2], std, size=100)
X2 = np.random.multivariate_normal([-2, 2], std, size=100)
X = np.vstack((X1, X2))
Y1 = np.random.multivariate_normal([2, 2], std, size=100)
Y2 = np.random.multivariate_normal([-2, -2], std, size=100)
Y = np.vstack((Y1, Y2))
plt.scatter(X[:, 0], X[:, 1])
plt.scatter(Y[:, 0], Y[:, 1])
plt.show()
Upvotes: 0