How to sum duplicate columns in dataframe and return nan if at least one value is nan

Question

I have a dataframe with duplicate columns (number not known a priori) like this example:

	a	a	a	b	b
0	1	1	1	1	1
1	1	nan	1	1	1

I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.

I have tried this code:

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()

The result i get is as follows, but it does not return NaN in the second row of column 'a'.

	a	b
0	3	2
1	2	2

In the documentation of pandas.DataFrame.sum, there is the skipna parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count which does what i want but the number is not known in advance and would be different for each duplicate column.

For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.

The result I want to achieve is:

	a	b
0	3	2
1	nan	2

mozway · Accepted Answer

One workaround might be to use apply to get the DataFrame.sum:

df.groupby(level=0, axis=1).apply(lambda x: x.sum(axis=1, skipna=False))

Output:

     a    b
0  3.0  2.0
1  NaN  2.0

How to sum duplicate columns in dataframe and return nan if at least one value is nan

Answers (2)

Related Questions