Reputation: 5583
Currently, I'm working with the COVID dataset to do some insights.
The dataset is of this form:
Country Province Lat Lon Date Cases Status
0 Thailand 15.0000 101.0000 2020-01-22 00:00:00+00:00 2 confirmed
1 Thailand 15.0000 101.0000 2020-01-23 00:00:00+00:00 3 confirmed
2 Thailand 15.0000 101.0000 2020-01-24 00:00:00+00:00 5 confirmed
3 Thailand 15.0000 101.0000 2020-01-25 00:00:00+00:00 7 confirmed
4 Thailand 15.0000 101.0000 2020-01-26 00:00:00+00:00 8 confirmed
I want to group by country, summing over the "Cases" column (we'll call this case sum column), but I run into a problem with latitude and longitude: I want to take the lat/long of the max of the case column. In other words, I would like the latitude and longitude from the row with the largest number of cases. To clarify, the use case is that a country like France has rows with multiple latitude and longitudes (like French Polynesia for example) but I just want to take the lat/long in the grouping from the area which has the most cases.
I currently am running an aggregation as follows:
nonzero_cases[(nonzero_cases['Date'] == "03/13/2020")].groupby("Country").agg({"Lat":"first","Lon":"first","Cases":"sum"})
This yields:
Country Lat Lon Cases
Afghanistan 33.0000 65.0000 7
Albania 41.1533 20.1683 33
Algeria 28.0339 1.6596 26
Andorra 42.5063 1.5218 1
...
But this is not quite what I want since it doesn't take into account the case numbers, and just picks the first Lat/Lon.
Upvotes: 4
Views: 1355
Reputation: 16683
A messier answer than jezrael's, but it gets the job done. Unfortunately, groupby, np.where, .loc and of course pd.merge are my go to on 50% of pandas.
nonzero_cases_agg = (nonzero_cases.groupby("Country")
.agg({"Cases":["sum","max"]}).reset_index())
df = pd.merge(nonzero_cases, nonzero_cases_agg, how='left', on='Country')
df = df.loc[df['Cases'] == df[('Cases', 'max')]].copy()
df['Cases'] = df[('Cases', 'sum')]
df = df.drop([('Cases', 'max'), ('Cases', 'sum')], axis=1)
df
Upvotes: 1
Reputation: 863751
Add DataFrame.sort_values
by column Cases
, so now first values are rows with maximum of Cases
per groups:
print (df)
Country Lat Lon Date Cases Status
0 Thailand 15.0 101.0 2020-01-22 00:00:00+00:00 2 confirmed
1 Thailand 15.0 101.0 2020-01-23 00:00:00+00:00 3 confirmed
2 Thailand 15.0 101.0 2020-01-24 00:00:00+00:00 5 confirmed
3 Thailand 15.0 101.0 2020-01-25 00:00:00+00:00 7 confirmed
4 Thailand 14.0 103.0 2020-01-26 00:00:00+00:00 8 confirmed <- changed data
df1 = (df.sort_values('Cases', ascending=False)
.groupby("Country")
.agg({"Lat":"first","Lon":"first","Cases":"sum"}))
print (df1)
Lat Lon Cases
Country
Thailand 14.0 103.0 25
Upvotes: 3