Reputation: 391
I am trying to get a sense of the rationale between some independent variables and quantify their importance on a dependent variable. I came across methods like the random forest that can quantify the importance of variables and then predict the outcome. However, I have an issue with the nature of the data to be used with the random forest or similar methods. An example of data structure is provided below, and as you can see the time series have some variables like population and Age that do not change with time, though different among the different city. While other variables such as temperature and #internet users are changing through time and within the cities. My question is: how can I quantify the importance of these variables on the “Y” variable? BTW, I prefer to apply the method in python environment.
Upvotes: 0
Views: 196
Reputation: 5355
"How can I quantity the importance" is very common question also known as "feature-importance".
The feature importance depends on your model; with a regression you have importance in your coefficients, in random forest you can use (but, some would not recommend) the build-in feature_importances_
or better the SHAP
-values. Further more you can use som correlaion i.e Spearman/Pearson correlation between your features and your target.
Unfortunately there is no "free lunch", you will need to decide that based on what you want to use it for, how your data looks like etc.
I think the one you came across might be Boruta where you shuffle up your variables, add them to your data set and then create a threshold based on the "best shuffled variable" in a Random Forest.
Upvotes: 1
Reputation: 23
If city and month are also your independent variables, you should convert them from index into columns. Using pandas to read your file, then use df.reset_index() can do the job for you.
Upvotes: 0