How can i generate random n-dimensional dataset for my regression task?

Question

I need to generate a random n-dimensional dataset having m tuples. The first four dimensions are expected to be correlated with the ground truth vector y and the remaining ones are to be arbitrarily generated. I will use the dataset for my regression task using Scikit-learn. How can I generate this data?

for example: A dataset where tuple size(m)=10000 and dimension size(n)=100

After that, I need to split the dataset such that randomly selected 70% tuples are used for training while 30% tuples are used for testing.

PS: I have found this code in sci-kit learn but I am not sure if I can use it. How can I translate this into my problem?

x, y, coef = datasets.make_regression(n_samples=100,#number of samples
                                  n_features=1,#number of features
                                  n_informative=1,#number of useful features 
                                  noise=10,#bias and standard deviation of the guassian noise
                                  coef=True,#true coefficient used to generated the data
                                  random_state=0) #set for same data points for each run

desertnaut · Accepted Answer

make_regression will do your job indeed:

from sklearn import datasets

# your example values:
m = 10000
n = 100
random_seed = 42 # for reproducibility, exact value does not matter

X, y = datasets.make_regression(n_samples=m, n_features=n, n_informative=4, random_state = random_seed)

And train_test_split for splitting:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed) #  does not need to be the same random_seed

In both methods, random_seed is set only for reproducibility - its exact value does not matter, neither need it be the same value in both.

How can i generate random n-dimensional dataset for my regression task?

Answers (1)

Related Questions