Reputation: 497
I have a pandas dataframe with two whose combination I consider the unique identifier and then one single column 'C' that can have different values. So the same unique identifier can repeat as many times as the number of different 'C' values it has.
import pandas as pd
data = {'A' : ['000001','000001','000001','000001','000002','000002','000003'],
'B' : ['1A','1A','1C','1D','1A','1A','1D'],
'C' : ['003','012','016','003','020','012','053']}
df = pd.DataFrame(data)
+-------------------+
| A | B | C |
+-------------------+
| 000001 | 1A | 003 |
| 000001 | 1A | 012 |
| 000001 | 1C | 016 |
| 000001 | 1D | 003 |
| 000002 | 1A | 020 |
| 000002 | 1A | 012 |
| 000003 | 1D | 053 |
+-------------------+
I want to group rows based on A, B and then pivot out the C column and turn them into dummies. This is what I need the final output to look like:
+----------------------------------------------------+
| A | B | C_003 | C_012 | C_016 | C_020 |C_053 |
+----------------------------------------------------+
| 000001 | 1A | 1 | 1 | 0 | 0 | 0 |
| 000001 | 1C | 0 | 0 | 1 | 0 | 0 |
| 000001 | 1D | 1 | 0 | 0 | 0 | 0 |
| 000002 | 1A | 0 | 1 | 0 | 1 | 0 |
| 000003 | 1D | 0 | 0 | 0 | 0 | 1 |
+----------------------------------------------------+
I'm really bad at pivot tables, but also just not sure if a pivot table can get my to my desired output. I'm looking for a super efficient process, as the actual data has several more columns as part of the identifier (like A, B), thousands of possible C values, and millions of rows. Any ideas?
Upvotes: 2
Views: 1146
Reputation: 863166
Use groupby
with size
, reshape by unstack
and filter all values higher as 1
by clip_upper
:
df = (df.groupby(['A','B', 'C'])
.size()
.unstack(fill_value=0)
.add_prefix('C_')
.clip_upper(1)
.reset_index())
print (df)
C A B C_003 C_012 C_016 C_020 C_053
0 000001 1A 1 1 0 0 0
1 000001 1C 0 0 1 0 0
2 000001 1D 1 0 0 0 0
3 000002 1A 0 1 0 1 0
4 000003 1D 0 0 0 0 1
Upvotes: 4
Reputation: 294488
pandas.get_dummies
to the rescue.
pd.get_dummies(df.set_index(['A', 'B'])).reset_index()
A B C_003 C_012 C_016 C_020 C_053
0 000001 1A 1 0 0 0 0
1 000001 1A 0 1 0 0 0
2 000001 1C 0 0 1 0 0
3 000001 1D 1 0 0 0 0
4 000002 1A 0 0 0 1 0
5 000002 1A 0 1 0 0 0
6 000003 1D 0 0 0 0 1
Or... (thanks Wen for the reminder)
pd.get_dummies(df.set_index(['A', 'B'])).sum(level=[0, 1]).reset_index()
A B C_003 C_012 C_016 C_020 C_053
0 000001 1A 1 1 0 0 0
1 000001 1C 0 0 1 0 0
2 000001 1D 1 0 0 0 0
3 000002 1A 0 1 0 1 0
4 000003 1D 0 0 0 0 1
Upvotes: 4
Reputation: 323316
You know what I remember my favorite function crosstab
pd.crosstab([df.A,df.B],df.C).reset_index()
Out[70]:
C A B 003 012 016 020 053
0 000001 1A 1 1 0 0 0
1 000001 1C 0 0 1 0 0
2 000001 1D 1 0 0 0 0
3 000002 1A 0 1 0 1 0
4 000003 1D 0 0 0 0 1
By using str
get_dummies
df.set_index(['A','B']).C.str.get_dummies().add_prefix('C_').sum(level=[0,1]).reset_index()
Out[60]:
A B C_003 C_012 C_016 C_020 C_053
0 000001 1A 1 1 0 0 0
1 000001 1C 0 0 1 0 0
2 000001 1D 1 0 0 0 0
3 000002 1A 0 1 0 1 0
4 000003 1D 0 0 0 0 1
Upvotes: 3