Reputation: 45
I am trying to use a dataframe like this(sorry for the formatting, I'm typing this on a phone):
'Date' 'Color' 'Jar'
0 '05-10-2017' 'Red' 1
1 '05-10-2017' 'Green' 2
2 '05-10-2017' 'Blue' 1
3 '05-10-2017' 'Red' 2
4 '05-10-2017' 'Blue' 1
5 '05-11-2017' 'Red' 2
6 '05-11-2017' 'Green' 1
7 '05-11-2017' 'Red' 2
8 '05-11-2017' 'Green' 1
9 '05-11-2017' 'Blue' 1
10 '05-11-2017' 'Blue' 2
11 '05-11-2017' 'Red' 2
12 '05-11-2017' 'Blue' 2
13 '05-11-2017' 'Blue' 1
14 '05-12-2017' 'Green' 2
15 '05-12-2017' 'Blue' 1
16 '05-12-2017' 'Red' 1
17 '05-12-2017' 'Blue' 2
18 '05-12-2017' 'Blue' 2
and deriving one that looks like the one below with the columns filled in with count of instances per date.
Date. Jar 1 Red Jar 2 Red Jar 1 Green Jar 2 Green Jar 1 Blue Jar 2 Blue
05-10-2017
05-11-2017
05-12-2017
I was trying to use groupby in order to accomplish this and was able to get the counts of each color for each day but I'm unsure of how to go about splitting the color columns by which Jar they came from. I also read that query or loc might bet options for accomplishing this. Any direction would be greatly appreciated.
Upvotes: 1
Views: 164
Reputation: 323226
Or you can try this
df=df.set_index(['Date','Color']).stack().reset_index()
df['Columns']=df['level_2']+' '+df[0].astype(str)+' '+df['Color']
df.groupby(['Date','Columns']).size().unstack().fillna(0)
Out[239]:
Columns Jar 1 Blue Jar 1 Green Jar 1 Red Jar 2 Blue Jar 2 Green \
Date
05-10-2017 2 0 1 0 1
05-11-2017 2 2 0 2 0
05-12-2017 1 0 1 2 1
Jar 2 Red
Date
05-10-2017 1
05-11-2017 3
05-12-2017 0
EDIT: Same approach, simpler, faster version
df['columns'] = 'Jar ' + df.Jar.astype(str) + ' ' + df.Color
df.groupby(['Date', 'columns']).Jar.count().unstack(fill_value=0)
This version should beat get_dummies
approach (or perform the same).
Upvotes: 1
Reputation: 402493
Option 1
pd.crosstab
df1
Date Color Jar
0 05-10-2017 Red 1
1 05-10-2017 Green 2
2 05-10-2017 Blue 1
3 05-10-2017 Red 2
4 05-10-2017 Blue 1
5 05-11-2017 Red 2
6 05-11-2017 Green 1
7 05-11-2017 Red 2
8 05-11-2017 Green 1
9 05-11-2017 Blue 1
10 05-11-2017 Blue 2
11 05-11-2017 Red 2
12 05-11-2017 Blue 2
13 05-11-2017 Blue 1
14 05-12-2017 Green 2
15 05-12-2017 Blue 1
16 05-12-2017 Red 1
17 05-12-2017 Blue 2
18 05-12-2017 Blue 2
df1 = pd.crosstab(df2.Date, [df2.Jar, df2.Color])
df1.columns = df1.columns.map('{0[0]} {0[1]}'.format) # borrowed this line from https://stackoverflow.com/a/46102413/4909087
df1 = df1.add_prefix('Jar ')
df1
Jar 1 Blue Jar 1 Green Jar 1 Red Jar 2 Blue Jar 2 Green \
Date
05-10-2017 2 0 1 0 1
05-11-2017 2 2 0 2 0
05-12-2017 1 0 1 2 1
Jar 2 Red
Date
05-10-2017 1
05-11-2017 3
05-12-2017
Option 2
pd.get_dummies
and df.groupby
df1 = df1.set_index('Date')
df1 = pd.get_dummies(df1.Jar.astype(str).str.cat(df1.Color, sep=' '))\
.add_prefix('Jar ').groupby(level=0).sum()
df1
Jar 1 Blue Jar 1 Green Jar 1 Red Jar 2 Blue Jar 2 Green \
Date
05-10-2017 2 0 1 0 1
05-11-2017 2 2 0 2 0
05-12-2017 1 0 1 2 1
Jar 2 Red
Date
05-10-2017 1
05-11-2017 3
05-12-2017 0
Performance
100 loops, best of 3: 13.4 ms per loop # pivot_table
100 loops, best of 3: 9.05 ms per loop # stacking, grouping, unstacking
100 loops, best of 3: 10.4 ms per loop # crosstab
100 loops, best of 3: 3.57 ms per loop # get_dummies
df * 10000
)10 loops, best of 3: 42.8 ms per loop # pivot_table
1 loop, best of 3: 913 ms per loop # stacking, grouping, unstacking
10 loops, best of 3: 43.1 ms per loop # crosstab
1 loop, best of 3: 885 ms per loop # get_dummies
What you want to use depends on your data.
Upvotes: 2
Reputation: 153460
Let's try this:
df_out = df.assign(count=1).pivot_table(index='Date',columns=['Jar','Color'], values='count',aggfunc='sum', fill_value=0)
df_out.columns = df_out.columns.map('{0[0]} {0[1]}'.format)
df_out.add_prefix('Jar ')
Output:
Jar 1 Blue Jar 1 Green Jar 1 Red Jar 2 Blue Jar 2 Green \
Date
05-10-2017 2 0 1 0 1
05-11-2017 2 2 0 2 0
05-12-2017 1 0 1 2 1
Jar 2 Red
Date
05-10-2017 1
05-11-2017 3
05-12-2017 0
Upvotes: 1