CuriousToKnow
CuriousToKnow

Reputation: 11

Can I set the importance of the features when generating data with make_classification? Which features are intended to be important by make_classif.?

I have a question about make_classification from scikit-learn. I have created a dataset with make_classification in order to test how well different models can distinguish important features from less important features.

So I want to set the features in make_classification accordingly. This means, I would like to know upfront which are more important features and less important features. I would also like to set or adjust which are more important features, if possible. I have set the following:


X,y = make_classification(n_samples=50000, n_features=10, n_informative=5, 
                    n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2,
                          class_sep=1,
                   flip_y=0.01, weights=[0.9,0.1], shuffle=True, random_state=42)

In the documentation from make_classification there is information about weight and scale, but that doesn't seem right for knowing or shaping the importance of features.

My question is not about how to determine feature importance when using a specific model or different models.

My questions are:

  1. Can I shape the importance of the features when generating data with make_classification? Which features are intended to be important by make_classification?
  2. Is it possible to set or influence the importance of the variables in make_classification?
  3. Are all informative features important? To the same degree? Is there an order between them? Can I adjust this in some way?
  4. How do I recognize which are the informative features?

Follow-up question:

  1. Is there another way to generate synthetic data which would fulfill the requirement to define the importance of the features / or know upfront which features are more important?

Thank you, any ideas or advice are highly appreciated.

Upvotes: 0

Views: 27

Answers (0)

Related Questions