Reputation: 13
I have a data from a pollution sensor that I wish to validate. I am comparing it to data from londonair.org.uk to compare it. I have created a simple linear regression model with my sensor data on the X-axis and the Londonair data on the Y axis, and was able to get a simple model (in the form of y=mx + c). My professor asked me to validate the model using k-fold cross validation but I am not sure how.
What I'm unsure about is on which dataset to perform the test. Should it be on the raw data taken from the sensor or should I take the data calculated via the regression model?
Upvotes: 0
Views: 669
Reputation: 5859
Mini-Introduction to K-Fold Cross-Validation
K-Fold cross-validation separates the training data set into k distinct equal sections, also known as "folds". Each fold in turn is considered as a testing set (also known as "validation set"), whilst the rest k - 1 blocks become the training set. The model trains by iterating through k - 1 blocks and tests the resulting model on the validation set block, where some metric is measured, e.g. accuracy, standard deviation, etc. The process is repeated k times, after which the mean of all model evaluations is calculated to determine the final model evaluation.
To summarize, K-fold cross-validation can be achieved in the following steps:
Shuffle randomly initial data set.
For each fold:
(a) Set first fold as the testing data set.
(b) Set remaining folds as training data set.
(c) Use training set to evolve model and use the model to evaluate the testing data set.
(d) Repeat k times.
Calculate average of model evaluations for k testing data set evaluations.
You are validating your model, i.e. you are trying to understand how well your model has been able to capture the underlying patterns and relationships in your raw data - so the data that you are using for training will be your raw data (training means you are feeding it into your model to enable it to learn), whilst the validation data is data that you feed into your model to see how well it has learned the training data. The basic k-fold cross validation idea is not to test the model on data that it has already seen before during training.
Specifically your Case
You have data with labels, each instance being a "pair": pollution data -> londonair data. Lets say you have 100 unique pairs - you would feed e.g. 80 such pairs into your model for training (if raw data pollution value is a
, londonair label is b
), and the rest 20 you would use for validation - you feed the model the pollution data and check that the model returns the correct londonair data label corresponding to the pollution data (if raw data pollution value is a
, what should the label be according to model?). Repeat process as described above in the introduction, then average results and this will reflect your model accuracy.
Upvotes: 3