leskovecg
leskovecg

Reputation: 93

reducing data of large dataframe

I'm a relatively new user to python and have question about reducing data of large dataframe. I have dataframe that has shape of (96350, 156). In my dataframe is column named CountryName that contains 160 countries, each country has about 600 instances.

Input:

df['CountryName'].unique()

Output:

array(['Aruba', 'Afghanistan', 'Angola', 'Albania', 'Andorra',
       'United Arab Emirates', 'Argentina', 'Australia', 'Austria',
       'Azerbaijan', 'Belgium', 'Benin', 'Burkina Faso', 'Bangladesh',
       'Bulgaria', 'Bahrain', 'Bahamas', 'Bosnia and Herzegovina',
...
       'Slovenia', 'Sweden', 'Eswatini', 'Seychelles', 'Chad', 'Togo',
       'Thailand', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Taiwan',
       'Tanzania', 'Uganda', 'Ukraine', 'Uruguay', 'United States',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'South Africa', 'Zambia',
       'Zimbabwe'], dtype=object)

Then I use that next line to train_test_split data. Input:

X_train, X_test = train_test_split(df, test_size=.3, stratify=df['CountryName'])

Do you know how can I quickly reduce data for each country? Each country needs let's say 60% data/instances ?

Upvotes: 1

Views: 74

Answers (1)

n1colas.m
n1colas.m

Reputation: 3989

You can use Pandas sample passing 60% to the frac parameter.

print(df.shape) # (96350, 1)

new_df = df.groupby('CountryName').sample(frac=0.6)

print(new_df.shape) # (57812, 1)
print(new_df.shape[0]/df.shape[0]) # 0.600020757654385

Upvotes: 1

Related Questions