Reputation: 1277
General info about my dataset: I have 40k data points and 5 features. I'm doing regression and trying to build a model that can predict the error of a GPS.
For example, imagine that your vehicle GPS is making an error of 10 meters and you want to correct it. So I brought another super GPS which is very accurate and measured 40k data while driving so in my dataset I have some vehicle informations which are speed, acceleration, yaw rate, timestamp and wheel angle and I have position information which are the ground truth longitudes latitudes and the false longitudes and latitudes from my normal GPS.
I'm transforming those latitudes and longitudes to an x and y just to know how much should I shift my false longitudes and latitudes so that my position can be more accurate and similar to the Ground truth values. Can my data be bad in this case? I'm trying to predict the error in longs and lats that the GPS makes so that I can later correct it. So it's a regression problem, and I'm using those features above to do that which I think they are informative since speed, acceleration, yaw rate and wheel angle are related somehow to a position (am I wrong?)
I'm asking this generally, I read some articles in the internet, that say that data is sometimes bad or the quality of the data is bad but I don't know what the mysterious sentences really mean.
I also had the problem when training neural networks that my loss start to decrease for the first 10-20 epochs and then it stuck on some high value and the network stops learning like if it were struggling to go out of that loss value but it can't. I tried to use only 100 data points instead of all the 40k and I noticed that it worked good, the NN achieved to fit those but as I increase the number of data points the performance become worse (do you have any ideas on this?)
Some people suggest that I don't have many data and many features and in this case it would be better to use some machine learning approach since it outperforms NNs in case of small datasets or if I have few features like in my case so I also tried using random forest and I noticed that it gives better results than neural networks but it also doesn't generalize well, even if it gave me good results on train and validation sets, when I try it on test data(data that the random forest have never seen), it perform really bad.
So I was reading in the internet what can cause those problems and I noticed that I sometimes saw people or articles that claim that maybe the quality of the data is bad! but what does this really mean? I thought neural networks can map any kind of data, if I have one feature and one target then neural networks can map those two together, at least it can overfit the data, right?
How can we define what is bad data, or better yet, how do I know if my data is bad? If there is a way to know that then I would probably save time and not start working on a project that will take me a month to complete and then figure out my data is bad. Also, does my case make sense? I find it weird that NNs gives very bad performance way worse than random forest. At least my NN should overfit the data, or am I wrong?
So I'll post a heatmap and pairplot image of my data, maybe this will make it easier to understand how my data is. The correlation with the target is not good and that's why I think maybe my data are uninformative for this task.
This is pair plot. xdistance and ydistance are my targets that I want to predict:
Heatmap:
Upvotes: 1
Views: 428
Reputation: 529
Take into account, that vehicle GPS device knows your speed. It means it knows your position (and time) of at least few previous points, and information sent to the driver may be not a raw information taken from a GPS sensor, but information after some (better or worse) calculations. The results of those calculation may be of different quality regarding speed consideration, some of them may not take into account an angle of a gentle bend, etc. So, your problem is not to compare raw readings from more and less precise GPS sensor, but rather to compare two algorithms. That is why speed and acceleration data, which seem to be obviously good features, fail.
EDIT2:
Possible source of target leakage.
Each electronic device has a drift. Let's imagine indications of two GPS in a non moving car. Both are not stable, they change slowly with time. We can calculate a long term (several days) bias between both readings. Let say the correction of the worse GPS regarding the better one because of this bias is equal to 2 m East and 1 m North. However, during a relatively short time of training/validation this difference may drift from 3 m East 1 m North to 2 m East 2 m North (mean 2.5 m East 1.5 m North), while during testing from 2 m East 0 m North to 1 m East 1 m North (mean 1.5 m East 0.5 m North). So, even simple using of mean target value calculated for train gives no greater error than 0.5 m in location (for valid), while using the same mean for the data read during testing gives 3 times higher error (1.5 m).
So, because of the described drift of both GPS, valid dataset should be taken from quite different region of timestamp than training dataset, if we want to avoid target leakage and use this model to predict test dataset.
Because testing was done on the same road as training/validation, the described effect of drift may be easily verified. Just calculate mean target for training, validation and test datasets. If abs(meanTestTarget-meanTrainTarget) differs significantly from abs(meanValidTarget-meanTrainTarget), it may be the explanation of target leakage.
Upvotes: 1
Reputation: 1163
The only way to know if you have bad data is to do "data exploration". Find out if there are strong correlations between features, check if there are missing values or many outliers and do plots of your quantities.
The problem you describe is something usually solved with Kalman filters (check out sensor fusion). It sounds like a solvable problem and for sure 40k data points is not too small for a neural network.
Maybe you are just doing something wrong with data normalization or you have a bad network architecture.
Try to add the plot of training and test loss for both the small and big dataset to the question, it can be helpful.
It's hard to say more without seeing the actual data and some code.
Upvotes: 1