Reputation: 93
I'm a relatively new user to python and have question about reducing data of large dataframe. I have dataframe that has shape of (96350, 156). In my dataframe is column named CountryName that contains 160 countries, each country has about 600 instances.
Input:
df['CountryName'].unique()
Output:
array(['Aruba', 'Afghanistan', 'Angola', 'Albania', 'Andorra',
'United Arab Emirates', 'Argentina', 'Australia', 'Austria',
'Azerbaijan', 'Belgium', 'Benin', 'Burkina Faso', 'Bangladesh',
'Bulgaria', 'Bahrain', 'Bahamas', 'Bosnia and Herzegovina',
...
'Slovenia', 'Sweden', 'Eswatini', 'Seychelles', 'Chad', 'Togo',
'Thailand', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Taiwan',
'Tanzania', 'Uganda', 'Ukraine', 'Uruguay', 'United States',
'Uzbekistan', 'Venezuela', 'Vietnam', 'South Africa', 'Zambia',
'Zimbabwe'], dtype=object)
Then I use that next line to train_test_split
data.
Input:
X_train, X_test = train_test_split(df, test_size=.3, stratify=df['CountryName'])
Do you know how can I quickly reduce data for each country? Each country needs let's say 60% data/instances ?
Upvotes: 1
Views: 74