Reputation: 867
I'm new to Pandas and I have a data frame of this form:
date category value
0 2017-11-30 13:58:57 A 901
1 2017-11-30 13:59:41 B 905
2 2017-11-30 13:59:41 C 925
First column is a date, second column is categorical with known three categories.
It was generated by:
import pandas as pd
df = pd.DataFrame.from_items( [('date', ['2017-11-30 13:58:57', '2017-11-30 13:59:41', '2017-11-30 13:59:41']),('category',['A','B', 'C']),("value", [901, 905, 925])])
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')
The problem is that for each date, not all categories are there. I wish to add the missing categories with missing values to get:
date category value
0 2017-11-30 13:58:57 A 901
1 2017-11-30 13:58:57 B nan
2 2017-11-30 13:58:57 C nan
3 2017-11-30 13:59:41 A nan
4 2017-11-30 13:59:41 B 905
5 2017-11-30 13:59:41 C 925
Is there a built-in way to do so without iterating the rows?
Upvotes: 1
Views: 609
Reputation: 862611
You can use reindex
by MultiIndex.from_product
:
df = df.set_index(['date','category'])
cats = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(cats).reset_index()
print (df)
date category value
0 2017-11-30 13:58:57 A 901.0
1 2017-11-30 13:58:57 B NaN
2 2017-11-30 13:58:57 C NaN
3 2017-11-30 13:59:41 A NaN
4 2017-11-30 13:59:41 B 905.0
5 2017-11-30 13:59:41 C 925.0
df = (df.set_index(['date','category'])['value']
.unstack()
.stack(dropna=False)
.reset_index(name='value'))
print (df)
date category value
0 2017-11-30 13:58:57 A 901.0
1 2017-11-30 13:58:57 B NaN
2 2017-11-30 13:58:57 C NaN
3 2017-11-30 13:59:41 A NaN
4 2017-11-30 13:59:41 B 905.0
5 2017-11-30 13:59:41 C 925.0
Upvotes: 1