d.a.d.a
d.a.d.a

Reputation: 1406

Pandas random sample with ration 1:1 of specific column entry

I have a pandas dataframe object with the columns ['text', 'label'] with the label being either of value 'pos' or 'neg'.

The problem is that I have way more columns with 'neg' label as I have with 'pos'.

Question now is is there a posibility to randomly select as much 'neg' sentences as 'pos' sentences, so I get a new dataframe with a ratio of 50:50 of both labels?

Do I have to count the 'pos' sentences put them all in a new dataframe and then do neg_df = dataframe.sample(n=pos_count) and append that to the all positive dataframe created earlier, or is there a faster way?

Thanks for your help.

Upvotes: 2

Views: 2108

Answers (1)

Alexander
Alexander

Reputation: 109546

# Sample data.
df = pd.DataFrame({'text': ['a', 'b', 'c', 'd', 'e'], 
                   'label': ['pos'] * 2 + ['neg'] * 3})
>>> df
  label text
0   pos    a
1   pos    b
2   neg    c
3   neg    d
4   neg    e

# Create views of 'pos' and 'neg' text.
neg_text = df.loc[df.label == 'neg', 'text']
pos_text = df.loc[df.label == 'pos', 'text']

# Equally sample 'pos' and 'neg' with replacement and concatenate into a dataframe.
result = pd.concat([neg_text.sample(n=5, replace=True).reset_index(drop=True), 
                    pos_text.sample(n=5, replace=True).reset_index(drop=True)], axis=1)

result.columns = ['neg', 'pos']

>>> result
  neg pos
0   c   b
1   d   a
2   c   b
3   d   a
4   e   a

Upvotes: 1

Related Questions