Anggia Bagaskara
Anggia Bagaskara

Reputation: 67

split pandas dataframe by label with ratio

Suppose that i have a dataframe contains 100 sentences (50 spam, 50 not spam)
goal : I need to split them for training : testing data with ratio of 80 : 20
which will be 80 testing data (40 spam + 40 not spam) and 20 testing data(10 spam + 10 not spam)

NB : im using pandas, and i need those ratio to be a variable on its own so i can change it

where im at:

import pandas as pd

df = pd.DataFrame({'sentence': {0: 'FU bro',
  1: 'Well thats kinda cool',
  2: 'Haha thats so funny',
  3: 'cant u make somethin else mtfk',
  4: 'what a shame'},
 'label': {0: 'spam', 1: 'not spam', 2: 'not spam', 3: 'spam', 4: 'spam'}})

spam = df.loc[df['label']=='spam']
not_spam = df.loc[df['label']=='not spam']

print(spam)
print(not_spam)
#print(df.loc[df['label']=='not spam'].sum)

here is the header of my dataframe looks like:

sentence label
FU bro spam
Well thats kinda cool not spam
Haha thats so funny not spam
cant u make somethin else mtfk spam
what a shame spam

Upvotes: 2

Views: 1108

Answers (2)

rhug123
rhug123

Reputation: 8768

Try this:

train = df.groupby('label').sample(frac=.8)
test = df.loc[df.index.difference(train.index)]

Upvotes: 2

AKS
AKS

Reputation: 19841

You can use DataFrame.sample() for this:

training_data_ratio = 0.8
train_spam = spam.sample(frac=training_data_ratio, random_state=0)
test_spam = spam.drop(train_spam.index)

And, similarly for the non spam data.


In addition, if you need to check that how many of entries are spam and not spam, you can use value_counts:

>>> df.label.value_counts()
spam        3
not spam    2
Name: label, dtype: int64

Upvotes: 2

Related Questions