coghmeister
coghmeister

Reputation: 5

Python new column based on NaN in other columns

I'm quite new to Python and this is my first ever question so please be gentle with me!

I have tried out answers to other similar questions but am still quite stuck.

I am using Pandas and I have a dataframe which is a merge from multiple different SQL tables and looks something like this:

Col_1   Col_2   Col_3   Col_4
1       NaN     NaN     NaN
2       Y       NaN     NaN
3       Z       C       S
4       NaN     B       W

I don't care about the values in Col_2 Col_3 and Col_4 (note these can be strings or integers or objects depending on the column)

I just care that at least one of these columns is populated so ideally would like a fifth column like:

Col_1   Col_2   Col_3   Col_4   Col_5
1       NaN     NaN     NaN     0
2       Y       NaN     NaN     1
3       Z       C       S       1
4       NaN     B       W       1

Then I want to drop the columns Col_2 to Col_4.

My initial thought was something like the function below, but this is reducing my dataframe from 50000 rows to 50. I don't want to delete any rows.

def function(row):
   if (isnull.row['col_2'] and isnull.row['col_3'] and isnull.row['col_3'] is None):
      return '0'
   else:
      return '1'

df['col_5'] = df.apply(lambda row: function (row),axis=1)

Any help would be much appreciated.

Upvotes: 0

Views: 4197

Answers (2)

EdChum
EdChum

Reputation: 394159

Use any and pass param axis=1 which tests row-wise this will produce a boolean array which when converted to int will convert all True values to 1 and False values to 0, this will be much faster than calling apply which is going to iterate row-wise and will be very slow:

In [30]:

df['Col_5'] = any(df[df.columns[1:]].notnull(), axis=1).astype(int)
df
Out[30]:
   Col_1 Col_2 Col_3 Col_4  Col_5
0      1   NaN   NaN   NaN      0
1      2     Y   NaN   NaN      1
2      3     Z     C     S      1
3      4   NaN     B     W      1

In [31]:

df = df[['Col_1', 'Col_5']]
df
Out[31]:
   Col_1  Col_5
0      1      0
1      2      1
2      3      1
3      4      1

Here is the output from any:

In [34]:

any(df[df.columns[1:]].notnull(), axis=1)
Out[34]:
array([False,  True,  True,  True], dtype=bool)

Timings

In [35]:

%timeit df[df.columns[1:]].apply(lambda x: all(x.isnull()) , axis=1).astype(int)
%timeit any(df[df.columns[1:]].notnull(), axis=1).astype(int)
100 loops, best of 3: 2.46 ms per loop
1000 loops, best of 3: 1.4 ms per loop

So on your test data for a df this size my method is over 2x faster than the other answer

Update

As you are running pandas version 0.12.0 then you need to call the top level notnull version as that method is not available at df level:

any(pd.notnull(df[df.columns[1:]]), axis=1).astype(int)

I suggest you upgrade as you'll get lots more features and bug fixes.

Upvotes: 4

JAB
JAB

Reputation: 12801

using a function:

df['col_5'] =df.apply(lambda x: all(x.isnull()) , axis=1)

for my money is a bit easier to read. Not sure which is quicker.

Upvotes: 0

Related Questions