jma
jma

Reputation: 497

Pandas pivot out unique column

I have a pandas dataframe with two whose combination I consider the unique identifier and then one single column 'C' that can have different values. So the same unique identifier can repeat as many times as the number of different 'C' values it has.

import pandas as pd
data = {'A' : ['000001','000001','000001','000001','000002','000002','000003'],
       'B' : ['1A','1A','1C','1D','1A','1A','1D'],
       'C' : ['003','012','016','003','020','012','053']}

df = pd.DataFrame(data)

+-------------------+
| A      | B  | C   | 
+-------------------+
| 000001 | 1A | 003 |
| 000001 | 1A | 012 |
| 000001 | 1C | 016 |
| 000001 | 1D | 003 |
| 000002 | 1A | 020 |
| 000002 | 1A | 012 |
| 000003 | 1D | 053 |
+-------------------+

I want to group rows based on A, B and then pivot out the C column and turn them into dummies. This is what I need the final output to look like:

+----------------------------------------------------+
| A      | B  | C_003 | C_012 | C_016 | C_020 |C_053 |
+----------------------------------------------------+
| 000001 | 1A | 1     | 1     | 0     | 0     | 0    | 
| 000001 | 1C | 0     | 0     | 1     | 0     | 0    | 
| 000001 | 1D | 1     | 0     | 0     | 0     | 0    | 
| 000002 | 1A | 0     | 1     | 0     | 1     | 0    | 
| 000003 | 1D | 0     | 0     | 0     | 0     | 1    | 
+----------------------------------------------------+

I'm really bad at pivot tables, but also just not sure if a pivot table can get my to my desired output. I'm looking for a super efficient process, as the actual data has several more columns as part of the identifier (like A, B), thousands of possible C values, and millions of rows. Any ideas?

Upvotes: 2

Views: 1146

Answers (3)

jezrael
jezrael

Reputation: 863166

Use groupby with size, reshape by unstack and filter all values higher as 1 by clip_upper:

df = (df.groupby(['A','B', 'C'])
      .size()
      .unstack(fill_value=0)
      .add_prefix('C_')
      .clip_upper(1)
      .reset_index())
print (df)
C       A   B  C_003  C_012  C_016  C_020  C_053
0  000001  1A      1      1      0      0      0
1  000001  1C      0      0      1      0      0
2  000001  1D      1      0      0      0      0
3  000002  1A      0      1      0      1      0
4  000003  1D      0      0      0      0      1

Upvotes: 4

piRSquared
piRSquared

Reputation: 294488

pandas.get_dummies to the rescue.

pd.get_dummies(df.set_index(['A', 'B'])).reset_index()

        A   B  C_003  C_012  C_016  C_020  C_053
0  000001  1A      1      0      0      0      0
1  000001  1A      0      1      0      0      0
2  000001  1C      0      0      1      0      0
3  000001  1D      1      0      0      0      0
4  000002  1A      0      0      0      1      0
5  000002  1A      0      1      0      0      0
6  000003  1D      0      0      0      0      1

Or... (thanks Wen for the reminder)

pd.get_dummies(df.set_index(['A', 'B'])).sum(level=[0, 1]).reset_index()

        A   B  C_003  C_012  C_016  C_020  C_053
0  000001  1A      1      1      0      0      0
1  000001  1C      0      0      1      0      0
2  000001  1D      1      0      0      0      0
3  000002  1A      0      1      0      1      0
4  000003  1D      0      0      0      0      1

Upvotes: 4

BENY
BENY

Reputation: 323316

You know what I remember my favorite function crosstab

pd.crosstab([df.A,df.B],df.C).reset_index()
Out[70]: 
C       A   B  003  012  016  020  053
0  000001  1A    1    1    0    0    0
1  000001  1C    0    0    1    0    0
2  000001  1D    1    0    0    0    0
3  000002  1A    0    1    0    1    0
4  000003  1D    0    0    0    0    1

By using str get_dummies

df.set_index(['A','B']).C.str.get_dummies().add_prefix('C_').sum(level=[0,1]).reset_index()
Out[60]: 
        A   B  C_003  C_012  C_016  C_020  C_053
0  000001  1A      1      1      0      0      0
1  000001  1C      0      0      1      0      0
2  000001  1D      1      0      0      0      0
3  000002  1A      0      1      0      1      0
4  000003  1D      0      0      0      0      1

Upvotes: 3

Related Questions