Reputation: 41
I want to perform some statistical comparisons between train and test sets, more specifically to compare the similarity of the distributions between features. Lets suppose we do this using the two-sample Kolmogorov-sminov test. But the way I want to perform such an analysis is to first calculate the part of the statistic on the train data, save it to disk and then only call this when the new data comes in to use it with the test data. So I dont want to load the entire train data frame to calculate the two-sample distribution similarity test. Is that possible somehow? If not with KS test, maybe some other, say kullback leibler divergence. Thanks.
Upvotes: 1
Views: 244
Reputation: 20130
Well, this is how I would approach that. I would build CDF from train set of data. Then this CDF would be stored on disk, and recalled when necessary
Later I would run sample vs CDF K-S test, say, using test with callable cdf (second parameter).
That callable CDF should be the one you've got from train set.
Upvotes: 0