Reputation: 1303
I have previously asked a similar question here How to get aggregate of data from multiple dates in pandas?
But my problem is slightly more complicated.
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'name':['a', 'b', 'c', 'd', 'e', 'f'],
'vaccine_1':['2021-01-20', '2021-01-20', '2021-02-20', np.nan, '2021-02-22', '2021-02-23'],
'vaccine_2':['2021-02-22', '2021-02-22', '2021-02-25', np.nan, '2021-03-22', np.nan],
'vaccine_type': ['AZ', 'AZ', 'AZ', np.nan, 'Sinovac', 'Sinovac'],
'gender':['F', 'M', 'F', 'F', 'M', 'M']})
df['vaccine_1'] = pd.to_datetime(df['vaccine_1']).dt.date
df['vaccine_2'] = pd.to_datetime(df['vaccine_2']).dt.date
df
I want to get a table that comprises something like this.
date | F | M | vaccine_type | vaccine_1_total | vaccine_2_total |
My original tables are way more complicated than this with more data, but I think sums up what I mean to do.
Thanks to the answers from my previous question, I can get the dates right using pandas melt.
out = df.melt(var_name='vaccine', value_name='date', value_vars=['vaccine_1', 'vaccine_2'])
print(pd.crosstab(out['date'], out['vaccine']))
Output:
vaccine vaccine_1 vaccine_2
date
2021-01-20 2 0
2021-02-20 1 0
2021-02-22 1 2
2021-02-23 1 0
2021-02-25 0 1
2021-03-22 0 1
But I don't know how to modify that to make it fit my needs. Any idea? Thanks.
EDIT:
Desired dataframe
date | F | M | vaccine_type | vaccine_1_total | vaccine_2_total
'2021-01-20'| 1 | 1 | AZ | 2 | 0
'2021-02-20'| 1 | 0 | AZ | 1 | 0
'2021-02-22'| 1 | 1 | AZ | 1 | 1
'2021-02-22'| 1 | 0 | Sinovac | 1 | 0
and so on
So I think it should be a combination of groupby and melt? I can use groupby to get the non date columns, but how do I combine that with melt?
Upvotes: 4
Views: 405
Reputation: 260780
You can first create a table with the combined counts for gender and vaccine_x, then merge on the sum on the different indicators:
df2 = (df.melt(id_vars=['gender', 'vaccine_type'],
value_vars=['vaccine_1', 'vaccine_2'],
var_name='vaccine', value_name='date')
.groupby(['date', 'vaccine_type', 'gender']).agg({'vaccine': 'value_counts'})
.rename(columns={'vaccine': 'count'})
.reset_index()
.pivot_table(index=['date', 'vaccine_type'], columns=['gender', 'vaccine'], values='count', fill_value=0)
)
pd.merge(df2.sum(level=0, axis=1).reset_index(),
df2.sum(level=1, axis=1).reset_index(),
on=['date', 'vaccine_type']
)
output:
date vaccine_type F M vaccine_1 vaccine_2
0 2021-01-20 AZ 1 1 2 0
1 2021-02-20 AZ 1 0 1 0
2 2021-02-22 AZ 1 1 0 2
3 2021-02-22 Sinovac 0 1 1 0
4 2021-02-23 Sinovac 0 1 1 0
5 2021-02-25 AZ 1 0 0 1
6 2021-03-22 Sinovac 0 1 0 1
intermediate output (df2):
gender F M
vaccine vaccine_1 vaccine_2 vaccine_1 vaccine_2
date vaccine_type
2021-01-20 AZ 1 0 1 0
2021-02-20 AZ 1 0 0 0
2021-02-22 AZ 0 1 0 1
Sinovac 0 0 1 0
2021-02-23 Sinovac 0 0 1 0
2021-02-25 AZ 0 1 0 0
2021-03-22 Sinovac 0 0 0 1
Upvotes: 2