Reputation: 2475
I have a Pandas Dataframe with hierarchically indexed columns. Eg:
cols = pd.MultiIndex.from_tuples([('syn', 'A'), ('mis', 'A'), ('non', 'A'), ('syn', 'C'), ('mis', 'C'), ('non', 'C'),
('syn', 'T'), ('mis', 'T'), ('non', 'T'), ('syn', 'G'), ('mis', 'G'), ('non', 'G')])
sample = pd.DataFrame(np.random.randint(1, 10, (4,12)), columns=cols, index=['A', 'C', 'G', 'T'])
sample.head()
Giving:
syn mis non syn mis non syn mis non syn mis non
A A A C C C T T T G G G
A 7 3 9 5 4 8 6 4 3 6 4 2
C 5 2 2 4 9 6 3 3 9 6 2 1
G 2 4 5 2 8 3 8 3 6 1 2 4
T 9 4 8 9 8 5 8 8 2 2 6 5
However, when I attempt to normalize on level 1 of the columns as follows:
sample.groupby(axis=1, level=1).transform(lambda z: z.div(z.sum(axis=1), axis=0))
I get an error:
ValueError: Length mismatch: Expected axis has 4 elements, new values have 12 elements
Weirdly, it works fine if I use up to 3 of the values of level 1 of axis 1, for example:
ind = sample.columns.get_level_values(1).isin(['A', 'C', 'G'])
subsample = sample.loc[:, ind]
subsample.head()
which just takes the first 3 sets of values from the sample:
syn mis non syn mis non syn mis non
A A A C C C G G G
A 7 3 9 5 4 8 6 4 2
C 5 2 2 4 9 6 6 2 1
G 2 4 5 2 8 3 1 2 4
T 9 4 8 9 8 5 2 6 5
Then:
subsample.groupby(axis=1, level=1).transform(lambda z: z.div(z.sum(axis=1), axis=0))
correctly returns:
syn mis non syn mis non syn mis non
A A A C C C G G G
A 0.37 0.16 0.47 0.29 0.24 0.47 0.50 0.33 0.17
C 0.56 0.22 0.22 0.21 0.47 0.32 0.67 0.22 0.11
G 0.18 0.36 0.45 0.15 0.62 0.23 0.14 0.29 0.57
T 0.43 0.19 0.38 0.41 0.36 0.23 0.15 0.46 0.38
Any idea why the latter works and the former doesn't? I'm using Pandas version 0.16.1
Upvotes: 0
Views: 534
Reputation: 2475
Apparently: sample.groupby(axis=1, level=0).apply(lambda z: z.div(z.sum(axis=1), axis=0))
works as intended returning:
syn mis non syn mis non syn mis non syn mis non
A A A C C C T T T G G G
A 0.24 0.38 0.38 0.36 0.14 0.50 0.19 0.31 0.50 0.50 0.17 0.33
C 0.22 0.44 0.33 0.31 0.12 0.56 0.25 0.58 0.17 0.08 0.17 0.75
G 0.28 0.36 0.36 0.80 0.10 0.10 0.38 0.08 0.54 0.27 0.64 0.09
T 0.24 0.41 0.35 0.60 0.33 0.07 0.43 0.43 0.14 0.55 0.27 0.18
Although I still don't understand why .transform didn't work as well. Thanks jreback for the answer.
Upvotes: 1
Reputation: 129018
This was a bug fixed in 0.16.1, see here: https://github.com/pydata/pandas/issues/9697
Upvotes: 1