Reputation: 53
Given a distribution of classes and a dataframe of rows of examples of those classes is there a simple/fast way to sample from the dataframe a distribution matching the given distribution where classes without enough examples attenuate the number of examples in the other classes:
e.g.
+------+-------+-------+
| col1 | col2 | class |
+------+-------+-------+
| 4 | 45 | A |
+------+-------+-------+
| 5 | 66 | B |
+------+-------+-------+
| 5 | 6 | C |
+------+-------+-------+
| 4 | 6 | A |
+------+-------+-------+
| 321 | 1 | A |
+------+-------+-------+
| 32 | 432 | A |
+------+-------+-------+
| 5 | 3 | B |
+------+-------+-------+
given a dataframe like above and the distribution like below:
+-------+--------------+
| class | proportion |
+-------+--------------+
| A | 0.50 |
+-------+--------------+
| B | 0.25 |
+-------+--------------+
| C | 0.25 |
+-------+--------------+
I would like to return something like:
+------+-------+-------+
| col1 | col2 | class |
+------+-------+-------+
| 5 | 66 | B |
+------+-------+-------+
| 5 | 6 | C |
+------+-------+-------+
| 4 | 6 | A |
+------+-------+-------+
| 32 | 432 | A |
+------+-------+-------+
Upvotes: 2
Views: 415
Reputation: 15738
df.sample
supports weighing entities:
s = pd.Series({'A': 0.5, 'B': 0.25, 'C': 0.25})
df.sample(n, weights=df['class'].map(s/df['class'].value_counts()))
To get more info on the topic, search for "label shift"
Upvotes: 1