Reputation: 137
I have data in the long format and would like to convert as wide format with sparse data:
df = pd.DataFrame({'id':[1000,1001,1000,1002,1003,1000],
'Date':[26018,26017,26016,26010,26009,26009]})
And it is to be converted as wide format by assigning ones:
Upvotes: 3
Views: 1060
Reputation: 863166
Idea is add helper column and pivoting by DataFrame.pivot
, add missing values by DataFrame.reindex
and last convert misisng values to 0
with casting to integers:
df1 = (df.assign(new = 1)
.pivot('id','Date','new')
.reindex(range(df['Date'].min(), df['Date'].max() + 1), axis=1)
.fillna(0)
.astype(int))
print (df1)
Date 26009 26010 26011 26012 26013 26014 26015 26016 26017 26018
id
1000 1 0 0 0 0 0 0 1 0 1
1001 0 0 0 0 0 0 0 0 1 0
1002 0 1 0 0 0 0 0 0 0 0
1003 1 0 0 0 0 0 0 0 0 0
If get:
ValueError: Index contains duplicate entries, cannot reshape
it means there are duplicates like in last row of changed sample data. Then add DataFrame.drop_duplicates
:
df = pd.DataFrame({'id':[1000,1001,1000,1002,1003,1000, 1000],
'Date':[26018,26017,26016,26010,26009,26009, 26009]})
df1 = (df.assign(new = 1)
.drop_duplicates(subset=['id','Date'])
.pivot('id','Date','new')
.reindex(range(df['Date'].min(), df['Date'].max() + 1), axis=1)
.fillna(0)
.astype(int))
print (df1)
Date 26009 26010 26011 26012 26013 26014 26015 26016 26017 26018
id
1000 1 0 0 0 0 0 0 1 0 1
1001 0 0 0 0 0 0 0 0 1 0
1002 0 1 0 0 0 0 0 0 0 0
1003 1 0 0 0 0 0 0 0 0 0
Notice: My solution not counts rows, only return 1
if exist intersection id
and Date
else 0
.
Upvotes: 4
Reputation: 25259
You may take advantage of groupby.size
and unstack
. Finally, reindex
as other solutions
ix = range(df.Date.min(), df.Date.max()+1)
df_final = (df.groupby(['id', 'Date']).size()
.unstack(fill_value=0)
.reindex(ix, axis=1, fill_value=0))
Out[205]:
Date 26009 26010 26011 26012 26013 26014 26015 26016 26017 26018
id
1000 1 0 0 0 0 0 0 1 0 1
1001 0 0 0 0 0 0 0 0 1 0
1002 0 1 0 0 0 0 0 0 0 0
1003 1 0 0 0 0 0 0 0 0 0
Upvotes: 1
Reputation: 13343
You can use crosstab
in case you are counting the frequency also
df1 = pd.crosstab(df['id'], df['Date']).reindex(range(df['Date'].min(), df['Date'].max() + 1), axis=1).fillna(0).astype(int)
Output of df1:
Date 26009 26010 26011 26012 26013 26014 26015 26016 26017 26018
id
1000 1 0 0 0 0 0 0 1 0 1
1001 0 0 0 0 0 0 0 0 1 0
1002 0 1 0 0 0 0 0 0 0 0
1003 1 0 0 0 0 0 0 0 0 0
Upvotes: 2