Erick Martinez
Erick Martinez

Reputation: 53

How to sample a dataframe based on a given distribution where limited classes attenuate the other classes?

Given a distribution of classes and a dataframe of rows of examples of those classes is there a simple/fast way to sample from the dataframe a distribution matching the given distribution where classes without enough examples attenuate the number of examples in the other classes:

e.g.

+------+-------+-------+
| col1 | col2 | class |
+------+-------+-------+
| 4    | 45    | A     |
+------+-------+-------+
| 5    | 66    | B     |
+------+-------+-------+
| 5    | 6     | C     |
+------+-------+-------+
| 4    | 6     | A     |
+------+-------+-------+
| 321  | 1     | A     |
+------+-------+-------+
| 32   | 432   | A     |
+------+-------+-------+
| 5    | 3     | B     |
+------+-------+-------+

given a dataframe like above and the distribution like below:
+-------+--------------+
| class | proportion   |
+-------+--------------+
| A     | 0.50         |
+-------+--------------+
| B     | 0.25         |
+-------+--------------+
| C     | 0.25         |
+-------+--------------+

I would like to return something like:
+------+-------+-------+
| col1 | col2 | class |
+------+-------+-------+
| 5    | 66    | B     |
+------+-------+-------+
| 5    | 6     | C     |
+------+-------+-------+
| 4    | 6     | A     |
+------+-------+-------+
| 32   | 432   | A     |
+------+-------+-------+


Upvotes: 2

Views: 415

Answers (1)

Marat
Marat

Reputation: 15738

df.sample supports weighing entities:

s = pd.Series({'A': 0.5, 'B': 0.25, 'C': 0.25})
df.sample(n, weights=df['class'].map(s/df['class'].value_counts()))

To get more info on the topic, search for "label shift"

Upvotes: 1

Related Questions