Reputation: 49
I have a legacy datafile that contains data in the following format:
SURVEY NUM TEMPORAL
WHS 1 Byz
WHS 1 Byz_Um
WHS 1 IAII
WHS 1 L_Isl
WHS 1 L_Rom
WHS 1 Mod
WHS 1 Nab
WHS 2 Byz
WHS 2 Mod
WHS 2 Unk
WHS 2 MP
WHS 3 Byz
WHS 3 Nab
WHS 3 LMP
WHS 3 UP
WHS 4 LMP
WHS 4 MP
WHS 4 UP
WHS 5 Byz
WHS 5 Unk
WHS 5 LMP
etc..
Essentially, the column "NUM" is a unique identifier that relates to a specific site, and the column "TEMPORAL" is a value associated with that site. For whatever reason, the original file repeates this over several lines for sites with mutltiple temporal occupations (this archaeological data). I would like to use Pandas to conver this to something like so:
SURVEY NUM Byz Byz_Um IAII L_Isl LMP L_Rom Nab MP Mod Unk UP
WHS 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0
WHS 2 1 0 0 0 0 0 0 0 0 1 1 0 0 1
WHS 3 1 0 0 0 0 0 0 0 1 1 1 0 1 0
WHS 4 0 0 0 0 0 0 0 0 0 0 0 1 0 1
WHS 5 1 0 0 0 0 0 0 1 0 0 0 0 1 0
Where a 1 is placed into a new column if that TEMPORAL period exists. I tried using df.pivot
with "NUM" as the index and "TEMPORAL" as the columns, but that did not work. There are several thousand sites in this database, so doing it manually is not a possibility. Any ideas?
Upvotes: 2
Views: 175
Reputation: 49
Thanks to piRSquared for leading me to the correct answer. The following code did exactly what I wanted:
df2 = pandas.crosstab(df['NUM'], [df['SURVEY'], df['TEMPORAL']])
Upvotes: 0
Reputation: 294298
np.bincount
and pd.factorize
i, r = pd.factorize(df['SURVEY'])
j, c = pd.factorize(df['NUM TEMPORAL'])
n, m = len(r), len(c)
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
pd.DataFrame(b, r, c)
Byz Byz_Um IAII L_Isl L_Rom Mod Nab Unk MP LMP UP
WHS 1 1 1 1 1 1 1 1 0 0 0 0
WHS 2 1 0 0 0 0 1 0 1 1 0 0
WHS 3 1 0 0 0 0 0 1 0 0 1 1
WHS 4 0 0 0 0 0 0 0 0 1 1 1
WHS 5 1 0 0 0 0 0 0 1 0 1 0
pd.get_dummies
and pd.DataFrame.dot
pd.get_dummies(df['SURVEY']).T.dot(pd.get_dummies(df['NUM TEMPORAL']))
Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk
WHS 1 1 1 1 0 1 1 0 1 1 0 0
WHS 2 1 0 0 0 0 0 1 1 0 0 1
WHS 3 1 0 0 1 0 0 0 0 1 1 0
WHS 4 0 0 0 1 0 0 1 0 0 1 0
WHS 5 1 0 0 1 0 0 0 0 0 0 1
pd.crosstab
pd.crosstab(*df.values.T)
col_0 Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk
row_0
WHS 1 1 1 1 0 1 1 0 1 1 0 0
WHS 2 1 0 0 0 0 0 1 1 0 0 1
WHS 3 1 0 0 1 0 0 0 0 1 1 0
WHS 4 0 0 0 1 0 0 1 0 0 1 0
WHS 5 1 0 0 1 0 0 0 0 0 0 1
Or
pd.crosstab(df['SURVEY'], df['NUM TEMPORAL'])
NUM TEMPORAL Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk
SURVEY
WHS 1 1 1 1 0 1 1 0 1 1 0 0
WHS 2 1 0 0 0 0 0 1 1 0 0 1
WHS 3 1 0 0 1 0 0 0 0 1 1 0
WHS 4 0 0 0 1 0 0 1 0 0 1 0
WHS 5 1 0 0 1 0 0 0 0 0 0 1
reconstruction
pd.Series(1, df.values.T.tolist()).unstack(fill_value=0)
Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk
WHS 1 1 1 1 0 1 1 0 1 1 0 0
WHS 2 1 0 0 0 0 0 1 1 0 0 1
WHS 3 1 0 0 1 0 0 0 0 1 1 0
WHS 4 0 0 0 1 0 0 1 0 0 1 0
WHS 5 1 0 0 1 0 0 0 0 0 0 1
Explanation
df.values.T.tolist()
passes a list of length 2, consisting of two levels. Those levels get interpreted as a MultiIndex.1
becomes the default value.unstack
with fill_value=0
reshapes accordinglyUpvotes: 2
Reputation: 323276
By using get_dummies
df.set_index(['SURVEY','NUM']).TEMPORAL.str.get_dummies().sum(level=1)
Out[127]:
Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk
NUM
1 1 1 1 0 1 1 0 1 1 0 0
2 1 0 0 0 0 0 1 1 0 0 1
3 1 0 0 1 0 0 0 0 1 1 0
4 0 0 0 1 0 0 1 0 0 1 0
5 1 0 0 1 0 0 0 0 0 0 1
Upvotes: 2