Jon J
Jon J

Reputation: 45

Pivot and count conditions in a pandas dataframe

I am trying to use a dataframe like this(sorry for the formatting, I'm typing this on a phone):

'Date'                  'Color'    'Jar'
0   '05-10-2017'    'Red'       1
1   '05-10-2017'    'Green'      2
2   '05-10-2017'    'Blue'       1
3   '05-10-2017'    'Red'      2
4   '05-10-2017'    'Blue'       1
5   '05-11-2017'    'Red'      2
6   '05-11-2017'    'Green'       1
7   '05-11-2017'    'Red'      2
8   '05-11-2017'    'Green'       1
9   '05-11-2017'    'Blue'       1
10  '05-11-2017'    'Blue'      2
11  '05-11-2017'    'Red'      2
12  '05-11-2017'    'Blue'      2
13  '05-11-2017'    'Blue'       1
14  '05-12-2017'    'Green'      2
15  '05-12-2017'    'Blue'       1
16  '05-12-2017'    'Red'       1
17  '05-12-2017'    'Blue'      2
18  '05-12-2017'    'Blue'       2

and deriving one that looks like the one below with the columns filled in with count of instances per date.

Date.                     Jar 1 Red    Jar 2 Red   Jar 1 Green  Jar 2 Green Jar 1 Blue Jar 2 Blue
05-10-2017
05-11-2017
05-12-2017

I was trying to use groupby in order to accomplish this and was able to get the counts of each color for each day but I'm unsure of how to go about splitting the color columns by which Jar they came from. I also read that query or loc might bet options for accomplishing this. Any direction would be greatly appreciated.

Upvotes: 1

Views: 164

Answers (3)

BENY
BENY

Reputation: 323226

Or you can try this

df=df.set_index(['Date','Color']).stack().reset_index()
df['Columns']=df['level_2']+' '+df[0].astype(str)+' '+df['Color']
df.groupby(['Date','Columns']).size().unstack().fillna(0)

Out[239]: 
Columns     Jar 1 Blue  Jar 1 Green  Jar 1 Red  Jar 2 Blue  Jar 2 Green  \
Date                                                                      
05-10-2017           2            0          1           0            1   
05-11-2017           2            2          0           2            0   
05-12-2017           1            0          1           2            1   

            Jar 2 Red  
Date                   
05-10-2017          1  
05-11-2017          3  
05-12-2017          0  

EDIT: Same approach, simpler, faster version

df['columns'] = 'Jar ' + df.Jar.astype(str) + ' ' + df.Color
df.groupby(['Date', 'columns']).Jar.count().unstack(fill_value=0) 

This version should beat get_dummies approach (or perform the same).

Upvotes: 1

cs95
cs95

Reputation: 402493

Option 1

pd.crosstab

df1

          Date  Color  Jar
0   05-10-2017    Red    1
1   05-10-2017  Green    2
2   05-10-2017   Blue    1
3   05-10-2017    Red    2
4   05-10-2017   Blue    1
5   05-11-2017    Red    2
6   05-11-2017  Green    1
7   05-11-2017    Red    2
8   05-11-2017  Green    1
9   05-11-2017   Blue    1
10  05-11-2017   Blue    2
11  05-11-2017    Red    2
12  05-11-2017   Blue    2
13  05-11-2017   Blue    1
14  05-12-2017  Green    2
15  05-12-2017   Blue    1
16  05-12-2017    Red    1
17  05-12-2017   Blue    2
18  05-12-2017   Blue    2

df1 = pd.crosstab(df2.Date, [df2.Jar, df2.Color])
df1.columns = df1.columns.map('{0[0]} {0[1]}'.format) # borrowed this line from https://stackoverflow.com/a/46102413/4909087
df1 = df1.add_prefix('Jar ')
df1

            Jar 1 Blue  Jar 1 Green  Jar 1 Red  Jar 2 Blue  Jar 2 Green  \
Date                                                                      
05-10-2017           2            0          1           0            1   
05-11-2017           2            2          0           2            0   
05-12-2017           1            0          1           2            1   

            Jar 2 Red  
Date                   
05-10-2017          1  
05-11-2017          3  
05-12-2017          

Option 2

pd.get_dummies and df.groupby

df1 = df1.set_index('Date')
df1 = pd.get_dummies(df1.Jar.astype(str).str.cat(df1.Color, sep=' '))\
                               .add_prefix('Jar ').groupby(level=0).sum()
df1

            Jar 1 Blue  Jar 1 Green  Jar 1 Red  Jar 2 Blue  Jar 2 Green  \
Date                                                                      
05-10-2017           2            0          1           0            1   
05-11-2017           2            2          0           2            0   
05-12-2017           1            0          1           2            1   

            Jar 2 Red  
Date                   
05-10-2017          1  
05-11-2017          3  
05-12-2017          0  

Performance

Small

100 loops, best of 3: 13.4 ms per loop # pivot_table
100 loops, best of 3: 9.05 ms per loop # stacking, grouping, unstacking
100 loops, best of 3: 10.4 ms per loop # crosstab
100 loops, best of 3: 3.57 ms per loop # get_dummies

Large (df * 10000)

10 loops, best of 3: 42.8 ms per loop # pivot_table
1 loop, best of 3: 913 ms per loop    # stacking, grouping, unstacking
10 loops, best of 3: 43.1 ms per loop # crosstab
1 loop, best of 3: 885 ms per loop    # get_dummies

What you want to use depends on your data.

Upvotes: 2

Scott Boston
Scott Boston

Reputation: 153460

Let's try this:

df_out = df.assign(count=1).pivot_table(index='Date',columns=['Jar','Color'], values='count',aggfunc='sum', fill_value=0)

df_out.columns = df_out.columns.map('{0[0]} {0[1]}'.format)

df_out.add_prefix('Jar ')

Output:

            Jar 1 Blue  Jar 1 Green  Jar 1 Red  Jar 2 Blue  Jar 2 Green  \
Date                                                                      
05-10-2017           2            0          1           0            1   
05-11-2017           2            2          0           2            0   
05-12-2017           1            0          1           2            1   

            Jar 2 Red  
Date                   
05-10-2017          1  
05-11-2017          3  
05-12-2017          0 

Upvotes: 1

Related Questions