Reputation: 1440
I have the following toy df:
FilterSystemO2Concentration (Percentage) ProcessChamberHumidityAbsolute (g/m3) ProcessChamberPressure (mbar)
0 0.156 1 29.5 28.4 29.6 28.4
2 0.149 1.3 29.567 28.9
3 0.149 1 29.567 28.9
4 0.148 1.6 29.6 29.4
This is just a sample. The original have over 1200 rows. What's the best way to oversample it preserving its statistical propierties?
I have googled it for some time and i hve only come across resampling algorithms for imbalalnced classes. but that's not what i want, i'm not interested in balancing thr data anyhow, i just would like to produce more samples in a way that more or less preserves the original data distributions and statistical properties.
Thanks in advance
Upvotes: 0
Views: 2308
Reputation: 4343
Using scipy.stats.rv_histogram(np.histogram(data)).isf(np.random.random(size=n))
will create n new samples randomly chosen from the distribution (histogram) of the data. You can do this for each column:
Example:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'x': np.random.random(100)*3, 'y': np.random.random(100) * 4 -2})
n = 5
new_values = pd.DataFrame({s: stats.rv_histogram(np.histogram(df[s])).isf(np.random.random(size=n)) for s in df.columns})
df = df.assign(data_type='original').append(new_values.assign(data_type='oversampled'))
df.tail(7)
>> x y data_type
98 1.176073 -0.207858 original
99 0.734781 -0.223110 original
0 2.014739 -0.369475 oversampled
1 2.825933 -1.122614 oversampled
2 0.155204 1.421869 oversampled
3 1.072144 -1.834163 oversampled
4 1.251650 1.353681 oversampled
Upvotes: 2