Reputation: 155
I need to generate a random n-dimensional dataset having m tuples. The first four dimensions are expected to be correlated with the ground truth vector y
and the remaining ones are to be arbitrarily generated. I will use the dataset for my regression task using Scikit-learn. How can I generate this data?
for example: A dataset where tuple size(m)=10000 and dimension size(n)=100
After that, I need to split the dataset such that randomly selected 70% tuples are used for training while 30% tuples are used for testing.
PS: I have found this code in sci-kit learn but I am not sure if I can use it. How can I translate this into my problem?
x, y, coef = datasets.make_regression(n_samples=100,#number of samples
n_features=1,#number of features
n_informative=1,#number of useful features
noise=10,#bias and standard deviation of the guassian noise
coef=True,#true coefficient used to generated the data
random_state=0) #set for same data points for each run
Upvotes: 2
Views: 582
Reputation: 60369
make_regression
will do your job indeed:
from sklearn import datasets
# your example values:
m = 10000
n = 100
random_seed = 42 # for reproducibility, exact value does not matter
X, y = datasets.make_regression(n_samples=m, n_features=n, n_informative=4, random_state = random_seed)
And train_test_split
for splitting:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed) # does not need to be the same random_seed
In both methods, random_seed
is set only for reproducibility - its exact value does not matter, neither need it be the same value in both.
Upvotes: 1