reducing data of large dataframe

Question

I'm a relatively new user to python and have question about reducing data of large dataframe. I have dataframe that has shape of (96350, 156). In my dataframe is column named CountryName that contains 160 countries, each country has about 600 instances.

Input:

df['CountryName'].unique()

Output:

array(['Aruba', 'Afghanistan', 'Angola', 'Albania', 'Andorra',
       'United Arab Emirates', 'Argentina', 'Australia', 'Austria',
       'Azerbaijan', 'Belgium', 'Benin', 'Burkina Faso', 'Bangladesh',
       'Bulgaria', 'Bahrain', 'Bahamas', 'Bosnia and Herzegovina',
...
       'Slovenia', 'Sweden', 'Eswatini', 'Seychelles', 'Chad', 'Togo',
       'Thailand', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Taiwan',
       'Tanzania', 'Uganda', 'Ukraine', 'Uruguay', 'United States',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'South Africa', 'Zambia',
       'Zimbabwe'], dtype=object)

Then I use that next line to train_test_split data. Input:

X_train, X_test = train_test_split(df, test_size=.3, stratify=df['CountryName'])

Do you know how can I quickly reduce data for each country? Each country needs let's say 60% data/instances ?

n1colas.m · Accepted Answer

You can use Pandas sample passing 60% to the frac parameter.

print(df.shape) # (96350, 1)

new_df = df.groupby('CountryName').sample(frac=0.6)

print(new_df.shape) # (57812, 1)
print(new_df.shape[0]/df.shape[0]) # 0.600020757654385

reducing data of large dataframe

Answers (1)

Related Questions