Reputation: 67
Suppose that i have a dataframe contains 100 sentences (50 spam, 50 not spam)
goal : I need to split them for training : testing data with ratio of 80 : 20
which will be 80 testing data (40 spam + 40 not spam) and 20 testing data(10 spam + 10 not spam)
NB : im using pandas, and i need those ratio to be a variable on its own so i can change it
where im at:
import pandas as pd
df = pd.DataFrame({'sentence': {0: 'FU bro',
1: 'Well thats kinda cool',
2: 'Haha thats so funny',
3: 'cant u make somethin else mtfk',
4: 'what a shame'},
'label': {0: 'spam', 1: 'not spam', 2: 'not spam', 3: 'spam', 4: 'spam'}})
spam = df.loc[df['label']=='spam']
not_spam = df.loc[df['label']=='not spam']
print(spam)
print(not_spam)
#print(df.loc[df['label']=='not spam'].sum)
here is the header of my dataframe looks like:
sentence | label |
---|---|
FU bro | spam |
Well thats kinda cool | not spam |
Haha thats so funny | not spam |
cant u make somethin else mtfk | spam |
what a shame | spam |
Upvotes: 2
Views: 1108
Reputation: 8768
Try this:
train = df.groupby('label').sample(frac=.8)
test = df.loc[df.index.difference(train.index)]
Upvotes: 2
Reputation: 19841
You can use DataFrame.sample()
for this:
training_data_ratio = 0.8
train_spam = spam.sample(frac=training_data_ratio, random_state=0)
test_spam = spam.drop(train_spam.index)
And, similarly for the non spam data.
In addition, if you need to check that how many of entries are spam and not spam, you can use value_counts
:
>>> df.label.value_counts()
spam 3
not spam 2
Name: label, dtype: int64
Upvotes: 2