helloB
helloB

Reputation: 3582

Pandas: take whichever column is not NaN

I am working with a fairly messy data set that has been individual csv files with slightly different names. It would be too onerous to rename columns in the csv file, partly because I am still discovering all the variations, so I am looking to determine, for a set of columns, in a given row, which field is not NaN and carrying that forward to a new column. Is there a way to do that?

Case in point. Let's say I have a data frame that looks like this:

Index   A     B
1       15    NaN
2       NaN   11
3       NaN   99
4       NaN   NaN
5       12    14

Let's say my desired output from this is to create a new column C such that my data frame will look like the following:

Index   A     B       C
1       15    NaN     15
2       NaN   11      11
3       NaN   99      99
4       NaN   NaN     NaN
5       12    14      12 (so giving priority to A over B)

How can I accomplish this?

Upvotes: 11

Views: 3575

Answers (5)

user18243nanana
user18243nanana

Reputation: 139

Or you could use 'df.apply' to give priority to column A.

def func1(row):
    A=row['A']
    B=row['B']
    if A==float('nan'):
       if B==float('nan'):
          y=float('nan')
       else:
          y=B
    else:
       y=A
    return y
df['C']=df.apply(func1,axis=1)

Upvotes: 0

Merlin
Merlin

Reputation: 25639

Try this: (This methods allows for flexiblity of giving preference to columns without relying on order of columns.)

Using @Alexanders setup.

df["D"] = df["B"]
df["D"] = df['D'].fillna(df['A'].fillna(df['B'].fillna(df['C'])))

      A     B     C     D
0  15.0   NaN  10.0  15.0
1   NaN  11.0   NaN  11.0
2   NaN  99.0  10.0  99.0
3   NaN   NaN  10.0  10.0
4  12.0  14.0  10.0  14.0

Upvotes: 3

Kartik
Kartik

Reputation: 8683

pandas.DataFrame.update:

df['updated'] = np.nan
for col in df.columns:
    df['updated'].update(df[col])

Upvotes: 1

Alexander
Alexander

Reputation: 109546

For a dataframe with an arbitrary number of columns, you can back fill the rows (.bfill(axis=1)) and take the first column (.iloc[:, 0]):

df = pd.DataFrame({
    'A': [15, None, None, None, 12],
    'B': [None, 11, 99, None, 14],
    'C': [10, None, 10, 10, 10]})

df['D'] = df.bfill(axis=1).iloc[:, 0]

>>> df
    A   B   C   D
0  15 NaN  10  15
1 NaN  11 NaN  11
2 NaN  99  10  99
3 NaN NaN  10  10
4  12  14  10  12

Upvotes: 7

maxymoo
maxymoo

Reputation: 36545

If you just have 2 columns, the cleanest way would be to use where (the syntax is where([condition], [value if condition is true], [value if condition is false]) (for some reason it took me a while to wrap my head around this).

In [2]: df.A.where(df.A.notnull(),df.B)
Out[2]:
0    15.0
1    11.0
2    99.0
3     NaN
4    12.0
Name: A, dtype: float64

If you have more than two columns, it might be simpler to use max or min; this will ignore the null values, however you'll lose the "column prececence" you want:

In [3]: df.max(axis=1)
Out[3]:
0    15.0
1    11.0
2    99.0
3     NaN
4    14.0
dtype: float64

Upvotes: 6

Related Questions