Reputation: 43
I am helping my girlfriend making a model for her master thesis project (Env. Sci). The dataset has these columns: Site Distance(m) Depth (cm) pH %N %C C:N
She measured pH and total Carbon and total Nitrogen from soil/peat samples from 5 different mires (wetlands).
'Distance (m)' is the distance away from a not random starting point (the wet zone), it also goes backwards into negative values in some of the sites. C:N is derived from %N and %C, and Depth is the depth at which the soil sample was taken.
How should we model the data? We suspect there is a relation between all of the variables..
Should the data be grouped by site, and then do a regression model and then compare to the other sites? Or how to you deal with 'sites' (categorical variables) against numerical values?
Upvotes: 0
Views: 87
Reputation: 362
You can use lots of technics to deal with that problem. One-Hot encoding is one of them. Actually it depends on your data. I highly recommend you to read this page to decide the best option: https://www.datacamp.com/community/tutorials/categorical-data Also you shouldn't select ur features by yourself.(We suspect there is a relation between all of the variables.. - > you dont have to determine which features are the most relevant ones). There is some methods that we can use. Check this out https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/
Upvotes: 1