Reputation: 152
I have a rather simple problem to which I cannot seem to find any solution. I have looked at many similar questions online but their solutions don't work in my case.
I have a pandas dataset in the following format
CountryCode ConfirmedCases ConfirmedDeaths date
335 ABW 4872.0 45.0 2020-12-01
336 ABW 4902.0 45.0 2020-12-02
337 ABW 4923.0 45.0 2020-12-03
338 ABW 4942.0 45.0 2020-12-04
339 ABW 4957.0 45.0 2020-12-05
ConfirmedCases and ConfirmedDeaths are both measured as cumulative sums by CountryCode, but I need the daily cases and daily deaths by CountryCode. This requires that I compute the difference between confirmed cases at time t and time t-1 for every country in my dataset--i.e. the difference between the number of cases today and the day before in the same country. Ideally, I would like a solution that allows me to register the daily number of cases and daily number of deaths by country as two new variables in the dataset. As a result, I would like something that should look like the following:
CountryCode ConfirmedCases ConfirmedDeaths date DailyCases Daily Deaths
335 ABW 4872.0 45.0 2020-12-01 15 0
336 ABW 4902.0 45.0 2020-12-02 30 0
337 ABW 4923.0 45.0 2020-12-03 21 0
338 ABW 4942.0 45.0 2020-12-04 19 0
339 ABW 4957.0 45.0 2020-12-05 15 0
Thanks in advance for your help!
Upvotes: 1
Views: 390
Reputation: 3276
Quang Hoang's solution is right except you should then replace NaN
values with the initial value for each country:
df['DailyCases'] = df.groupby('CountryCode')['ConfirmedCases'].diff()
df['DailyCases'] = df['DailyCases'].fillna(df['ConfirmedCases'])
Upvotes: 1