IcemanBerlin
IcemanBerlin

Reputation: 3427

Using an if statement in a dataframe with lambda functions

I am trying to add a new column to a dataframe based on an if statement depending on the values of two columns. i.e. if column x == None then column y else column x

below is the script I have written but doesn't work. any ideas?

dfCurrentReportResults['Retention'] =  dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x)

Also I got this error message: AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index BUSINESSUNIT_NAME')

fyi: BUSINESSUNIT_NAME is the first column name

Additional Info:

My data printed out looks like this and I want to add a 3rd column to take a value if there is one else keep NaN.

   Retention_x  Retention_y
0            1          NaN
1          NaN     0.672183
2          NaN     1.035613
3          NaN     0.771469
4          NaN     0.916667
5          NaN          NaN
6          NaN          NaN
7          NaN          NaN
8          NaN          NaN
9          NaN          NaN

UPDATE: In the end I was having issues referencing the Null or is Null in my dataframe the final line of code I used also including the axis = 1 answered my question.

 dfCurrentReportResults['RetentionLambda'] = dfCurrentReportResults.apply(lambda x : x['Retention_y'] if pd.isnull(x['Retention_x']) else x['Retention_x'], axis = 1)

Thanks @EdChum, @strim099 and @aus_lacy for all your input. As my data set gets larger I may switch to the np.where option if I notice performance issues.

Upvotes: 0

Views: 8422

Answers (2)

EdChum
EdChum

Reputation: 393933

Just use np.where:

dfCurrentReportResults['Retention'] =  np.where(df.Retention_x == None, df.Retention_y, else df.Retention_x)

This uses the test condition, the first param and sets the value to df.Retention_y else df.Retention_x

also avoid using apply where possible as this is just going to loop over the values, np.where is a vectorised method and will scale much better.

UPDATE

OK no need to use np.where just use the following simpler syntax:

dfCurrentReportResults['Retention'] =  df.Retention_y.where(df.Retention_x == None, df.Retention_x)

Further update

dfCurrentReportResults['Retention'] =  df.Retention_y.where(df.Retention_x.isnull(), df.Retention_x)

Upvotes: 2

Jason Strimpel
Jason Strimpel

Reputation: 15466

You'r lambda is operating on the 0 axis which is columnwise. Simply add axis=1 to the apply arg list. This is clearly documented.

In [1]: import pandas

In [2]: dfCurrentReportResults = pandas.DataFrame([['a','b'],['c','d'],['e','f'],['g','h'],['i','j']], columns=['Retention_y', 'Retention_x'])

In [3]: dfCurrentReportResults['Retention_x'][1] = None

In [4]: dfCurrentReportResults['Retention_x'][3] = None

In [5]: dfCurrentReportResults
Out[5]:
  Retention_y Retention_x
0           a           b
1           c        None
2           e           f
3           g        None
4           i           j

In [6]: dfCurrentReportResults['Retention'] =  dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x, axis=1)

In [7]: dfCurrentReportResults
Out[7]:
  Retention_y Retention_x Retention
0           a           b         b
1           c        None         c
2           e           f         f
3           g        None         g
4           i           j         j

Upvotes: 4

Related Questions