Reputation: 49

Convert variable to multiple columns with Pandas

I have a legacy datafile that contains data in the following format:

SURVEY  NUM TEMPORAL
WHS 1   Byz
WHS 1   Byz_Um
WHS 1   IAII
WHS 1   L_Isl
WHS 1   L_Rom
WHS 1   Mod
WHS 1   Nab
WHS 2   Byz
WHS 2   Mod
WHS 2   Unk
WHS 2   MP
WHS 3   Byz
WHS 3   Nab
WHS 3   LMP
WHS 3   UP
WHS 4   LMP
WHS 4   MP
WHS 4   UP
WHS 5   Byz
WHS 5   Unk
WHS 5   LMP

etc..

Essentially, the column "NUM" is a unique identifier that relates to a specific site, and the column "TEMPORAL" is a value associated with that site. For whatever reason, the original file repeates this over several lines for sites with mutltiple temporal occupations (this archaeological data). I would like to use Pandas to conver this to something like so:

SURVEY NUM  Byz Byz_Um IAII L_Isl LMP L_Rom Nab MP Mod Unk UP
WHS 1   1  1  1  1  1  0  0  0  0  0  1  0  0  0
WHS 2   1  0  0  0  0  0  0  0  0  1  1  0  0  1
WHS 3   1  0  0  0  0  0  0  0  1  1  1  0  1  0
WHS 4   0  0  0  0  0  0  0  0  0  0  0  1  0  1
WHS 5   1  0  0  0  0  0  0  1  0  0  0  0  1  0

Where a 1 is placed into a new column if that TEMPORAL period exists. I tried using df.pivot with "NUM" as the index and "TEMPORAL" as the columns, but that did not work. There are several thousand sites in this database, so doing it manually is not a possibility. Any ideas?

Upvotes: 2

Answers (3)

Computarch

Reputation: 49

Thanks to piRSquared for leading me to the correct answer. The following code did exactly what I wanted:

df2 = pandas.crosstab(df['NUM'], [df['SURVEY'], df['TEMPORAL']])

Upvotes: 0

piRSquared

Reputation: 294298

np.bincount and pd.factorize

i, r = pd.factorize(df['SURVEY'])
j, c = pd.factorize(df['NUM TEMPORAL'])
n, m = len(r), len(c)

b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

pd.DataFrame(b, r, c)

       Byz  Byz_Um  IAII  L_Isl  L_Rom  Mod  Nab  Unk  MP  LMP  UP
WHS 1    1       1     1      1      1    1    1    0   0    0   0
WHS 2    1       0     0      0      0    1    0    1   1    0   0
WHS 3    1       0     0      0      0    0    1    0   0    1   1
WHS 4    0       0     0      0      0    0    0    0   1    1   1
WHS 5    1       0     0      0      0    0    0    1   0    1   0

pd.get_dummies and pd.DataFrame.dot

pd.get_dummies(df['SURVEY']).T.dot(pd.get_dummies(df['NUM TEMPORAL']))

       Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
WHS 1    1       1     1    0      1      1   0    1    1   0    0
WHS 2    1       0     0    0      0      0   1    1    0   0    1
WHS 3    1       0     0    1      0      0   0    0    1   1    0
WHS 4    0       0     0    1      0      0   1    0    0   1    0
WHS 5    1       0     0    1      0      0   0    0    0   0    1

pd.crosstab

pd.crosstab(*df.values.T)

col_0  Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
row_0                                                             
WHS 1    1       1     1    0      1      1   0    1    1   0    0
WHS 2    1       0     0    0      0      0   1    1    0   0    1
WHS 3    1       0     0    1      0      0   0    0    1   1    0
WHS 4    0       0     0    1      0      0   1    0    0   1    0
WHS 5    1       0     0    1      0      0   0    0    0   0    1

pd.crosstab(df['SURVEY'], df['NUM TEMPORAL'])

NUM TEMPORAL  Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
SURVEY                                                                   
WHS 1           1       1     1    0      1      1   0    1    1   0    0
WHS 2           1       0     0    0      0      0   1    1    0   0    1
WHS 3           1       0     0    1      0      0   0    0    1   1    0
WHS 4           0       0     0    1      0      0   1    0    0   1    0
WHS 5           1       0     0    1      0      0   0    0    0   0    1

reconstruction

pd.Series(1, df.values.T.tolist()).unstack(fill_value=0)

       Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
WHS 1    1       1     1    0      1      1   0    1    1   0    0
WHS 2    1       0     0    0      0      0   1    1    0   0    1
WHS 3    1       0     0    1      0      0   0    0    1   1    0
WHS 4    0       0     0    1      0      0   1    0    0   1    0
WHS 5    1       0     0    1      0      0   0    0    0   0    1

Explanation

df.values.T.tolist() passes a list of length 2, consisting of two levels. Those levels get interpreted as a MultiIndex.
1 becomes the default value.
unstack with fill_value=0 reshapes accordingly

Upvotes: 2

BENY

Reputation: 323276

By using get_dummies

df.set_index(['SURVEY','NUM']).TEMPORAL.str.get_dummies().sum(level=1)
Out[127]: 
     Byz  Byz_Um  IAII  LMP  L_Isl  L_Rom  MP  Mod  Nab  UP  Unk
NUM                                                             
1      1       1     1    0      1      1   0    1    1   0    0
2      1       0     0    0      0      0   1    1    0   0    1
3      1       0     0    1      0      0   0    0    1   1    0
4      0       0     0    1      0      0   1    0    0   1    0
5      1       0     0    1      0      0   0    0    0   0    1

Upvotes: 2

Convert variable to multiple columns with Pandas

Answers (3)

Related Questions