Martin Moltke Wozniak
Martin Moltke Wozniak

Reputation: 43

How to make a statistical model with data from different locations (categorical variables)?

I am helping my girlfriend making a model for her master thesis project (Env. Sci). The dataset has these columns: Site Distance(m) Depth (cm) pH %N %C C:N

She measured pH and total Carbon and total Nitrogen from soil/peat samples from 5 different mires (wetlands).

'Distance (m)' is the distance away from a not random starting point (the wet zone), it also goes backwards into negative values in some of the sites. C:N is derived from %N and %C, and Depth is the depth at which the soil sample was taken.

How should we model the data? We suspect there is a relation between all of the variables..

Should the data be grouped by site, and then do a regression model and then compare to the other sites? Or how to you deal with 'sites' (categorical variables) against numerical values?

Upvotes: 0

Views: 87

Answers (1)

Zerzavot
Zerzavot

Reputation: 362

You can use lots of technics to deal with that problem. One-Hot encoding is one of them. Actually it depends on your data. I highly recommend you to read this page to decide the best option: https://www.datacamp.com/community/tutorials/categorical-data Also you shouldn't select ur features by yourself.(We suspect there is a relation between all of the variables.. - > you dont have to determine which features are the most relevant ones). There is some methods that we can use. Check this out https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/

https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2

Upvotes: 1

Related Questions