Jayahe
Jayahe

Reputation: 137

How to convert long data format to wide data format in pandas

I have data in the long format and would like to convert as wide format with sparse data:

df = pd.DataFrame({'id':[1000,1001,1000,1002,1003,1000],
                   'Date':[26018,26017,26016,26010,26009,26009]})

input data

And it is to be converted as wide format by assigning ones:

output data

Upvotes: 3

Views: 1060

Answers (3)

jezrael
jezrael

Reputation: 863166

Idea is add helper column and pivoting by DataFrame.pivot, add missing values by DataFrame.reindex and last convert misisng values to 0 with casting to integers:

df1 = (df.assign(new = 1)
         .pivot('id','Date','new')
         .reindex(range(df['Date'].min(), df['Date'].max() + 1), axis=1)
         .fillna(0)
         .astype(int))
print (df1)
Date  26009  26010  26011  26012  26013  26014  26015  26016  26017  26018
id                                                                        
1000      1      0      0      0      0      0      0      1      0      1
1001      0      0      0      0      0      0      0      0      1      0
1002      0      1      0      0      0      0      0      0      0      0
1003      1      0      0      0      0      0      0      0      0      0

If get:

ValueError: Index contains duplicate entries, cannot reshape

it means there are duplicates like in last row of changed sample data. Then add DataFrame.drop_duplicates:

df = pd.DataFrame({'id':[1000,1001,1000,1002,1003,1000, 1000],
                   'Date':[26018,26017,26016,26010,26009,26009, 26009]})


df1 = (df.assign(new = 1)
         .drop_duplicates(subset=['id','Date'])
         .pivot('id','Date','new')
         .reindex(range(df['Date'].min(), df['Date'].max() + 1), axis=1)
         .fillna(0)
         .astype(int))
print (df1)
Date  26009  26010  26011  26012  26013  26014  26015  26016  26017  26018
id                                                                        
1000      1      0      0      0      0      0      0      1      0      1
1001      0      0      0      0      0      0      0      0      1      0
1002      0      1      0      0      0      0      0      0      0      0
1003      1      0      0      0      0      0      0      0      0      0

Notice: My solution not counts rows, only return 1 if exist intersection id and Date else 0.

Upvotes: 4

Andy L.
Andy L.

Reputation: 25259

You may take advantage of groupby.size and unstack. Finally, reindex as other solutions

ix = range(df.Date.min(), df.Date.max()+1)
df_final = (df.groupby(['id', 'Date']).size()
              .unstack(fill_value=0)
              .reindex(ix, axis=1, fill_value=0))

Out[205]:
Date  26009  26010  26011  26012  26013  26014  26015  26016  26017  26018
id
1000      1      0      0      0      0      0      0      1      0      1
1001      0      0      0      0      0      0      0      0      1      0
1002      0      1      0      0      0      0      0      0      0      0
1003      1      0      0      0      0      0      0      0      0      0

Upvotes: 1

Pygirl
Pygirl

Reputation: 13343

You can use crosstab in case you are counting the frequency also

df1 = pd.crosstab(df['id'], df['Date']).reindex(range(df['Date'].min(), df['Date'].max() + 1), axis=1).fillna(0).astype(int) 

Output of df1:

Date  26009  26010  26011  26012  26013  26014  26015  26016  26017  26018
id                                                                        
1000      1      0      0      0      0      0      0      1      0      1
1001      0      0      0      0      0      0      0      0      1      0
1002      0      1      0      0      0      0      0      0      0      0
1003      1      0      0      0      0      0      0      0      0      0

Upvotes: 2

Related Questions