dylkot
dylkot

Reputation: 2475

Normalizing columns of multiindex dataframe in Pandas (maybe a bug?)

I have a Pandas Dataframe with hierarchically indexed columns. Eg:

cols = pd.MultiIndex.from_tuples([('syn', 'A'), ('mis', 'A'), ('non', 'A'),     ('syn', 'C'), ('mis', 'C'), ('non', 'C'),
                              ('syn', 'T'), ('mis', 'T'), ('non', 'T'), ('syn', 'G'), ('mis', 'G'), ('non', 'G')])
sample = pd.DataFrame(np.random.randint(1, 10, (4,12)), columns=cols, index=['A', 'C', 'G', 'T'])
sample.head()

Giving:

    syn mis non syn mis non syn mis non syn mis non
    A   A   A   C   C   C   T   T   T   G   G   G
A   7   3   9   5   4   8   6   4   3   6   4   2
C   5   2   2   4   9   6   3   3   9   6   2   1
G   2   4   5   2   8   3   8   3   6   1   2   4
T   9   4   8   9   8   5   8   8   2   2   6   5

However, when I attempt to normalize on level 1 of the columns as follows:

sample.groupby(axis=1, level=1).transform(lambda z: z.div(z.sum(axis=1), axis=0))

I get an error:

ValueError: Length mismatch: Expected axis has 4 elements, new values have 12 elements

Weirdly, it works fine if I use up to 3 of the values of level 1 of axis 1, for example:

ind = sample.columns.get_level_values(1).isin(['A', 'C', 'G'])
subsample = sample.loc[:, ind]
subsample.head()

which just takes the first 3 sets of values from the sample:

    syn mis non syn mis non syn mis non
    A   A   A   C   C   C   G   G   G
A   7   3   9   5   4   8   6   4   2
C   5   2   2   4   9   6   6   2   1
G   2   4   5   2   8   3   1   2   4
T   9   4   8   9   8   5   2   6   5

Then:

subsample.groupby(axis=1, level=1).transform(lambda z: z.div(z.sum(axis=1), axis=0))

correctly returns:

    syn mis non syn mis non syn mis non
    A       A       A       C       C       C       G       G       G
A   0.37    0.16    0.47    0.29    0.24    0.47    0.50    0.33    0.17
C   0.56    0.22    0.22    0.21    0.47    0.32    0.67    0.22    0.11
G   0.18    0.36    0.45    0.15    0.62    0.23    0.14    0.29    0.57
T   0.43    0.19    0.38    0.41    0.36    0.23    0.15    0.46    0.38

Any idea why the latter works and the former doesn't? I'm using Pandas version 0.16.1

Upvotes: 0

Views: 534

Answers (2)

dylkot
dylkot

Reputation: 2475

Apparently: sample.groupby(axis=1, level=0).apply(lambda z: z.div(z.sum(axis=1), axis=0))

works as intended returning:

syn mis non syn mis non syn mis non syn mis non
    A       A       A       C       C       C       T       T       T       G       G       G
A   0.24    0.38    0.38    0.36    0.14    0.50    0.19    0.31    0.50    0.50    0.17    0.33
C   0.22    0.44    0.33    0.31    0.12    0.56    0.25    0.58    0.17    0.08    0.17    0.75
G   0.28    0.36    0.36    0.80    0.10    0.10    0.38    0.08    0.54    0.27    0.64    0.09
T   0.24    0.41    0.35    0.60    0.33    0.07    0.43    0.43    0.14    0.55    0.27    0.18

Although I still don't understand why .transform didn't work as well. Thanks jreback for the answer.

Upvotes: 1

Jeff
Jeff

Reputation: 129018

This was a bug fixed in 0.16.1, see here: https://github.com/pydata/pandas/issues/9697

Upvotes: 1

Related Questions