Reputation: 641

understanding lambda functions in pandas

I'm trying to solve a problem for a course in Python and found someone has implemented solutions for the same problem in github. I'm just trying to understand the solution given in github.

I have a pandas dataframe called Top15 with 15 countries and one of the columns in the dataframe is 'HighRenew'. This column stores the % of renewable energy used in each country. My task is to convert the column values in 'HighRenew' column into boolean datatype.

If the value for a particular country is higher than the median renewable energy percentage in all the 15 countries then I should encode it as 1 otherwise it should a 0. The 'HighRenew' column is sliced out as a Series from the dataframe, which is copied below.

Country
China                  True
United States         False
Japan                 False
United Kingdom        False
Russian Federation     True
Canada                 True
Germany                True
India                 False
France                 True
South Korea           False
Italy                  True
Spain                  True
Iran                  False
Australia             False
Brazil                 True
Name: HighRenew, dtype: bool

The github solution is implemented in 3 steps, of which I understand the first 2 but not the last one where lambda function is used. Can someone explain how this lambda function works?

median_value = Top15['% Renewable'].median()
Top15['HighRenew'] = Top15['% Renewable']>=median_value
Top15['HighRenew'] = Top15['HighRenew'].apply(lambda x:1 if x else 0)

Upvotes: 3

Answers (3)

Brandon Barney

Reputation: 2392

Instead of using workarounds or lambdas, just use Panda's built-in functionality meant for this problem. The approach is called masking, and in essence we use comparators against a Series (column of a df) to get the boolean values:

import pandas as pd
import numpy as np

foo = [{
    'Country': 'Germany',
    'Percent Renew': 100
}, {
    'Country': 'Germany',
    'Percent Renew': 75
}, {
    'Country': 'China',
    'Percent Renew': 25
}, {
    'Country': 'USA',
    'Percent Renew': 5
}]

df = pd.DataFrame(foo, index=pd.RangeIndex(0, len(foo)))

df

#| Country   | Percent Renew |
#| Germany   | 100           |
#| Australia | 75            |
#| China     | 25            |
#| USA       | 5             |

np.mean(df['Percent Renew'])
# 51.25

df['Better Than Average'] = df['Percent Renew'] > np.mean(df['Percent Renew'])

#| Country   | Percent Renew | Better Than Average |
#| Germany   | 100           | True
#| Australia | 75            | True
#| China     | 25            | False
#| USA       | 5             | False

The reason specifically why I propose this over the other solutions is that masking can be used for a host of other purposes as well. I wont get into them here, but once you learn that pandas supports this kind of functionality, it becomes a lot easier to perform other data manipulations in pandas.

EDIT: I read needing boolean datatype as needing True False and not as needing the encoded version 1 and 0 in which case the astype that was proposed will sufficiently convert the booleans to integer values. For masking purposes though, the 'True' 'False' is needed for slicing.

Upvotes: 0

jpp

Reputation: 164773

lambda represents an anonymous (i.e. unnamed) function. If it is used with pd.Series.apply, each element of the series is fed into the lambda function. The result will be another pd.Series with each element run through the lambda.

apply + lambda is just a thinly veiled loop. You should prefer to use vectorised functionality where possible. @jezrael offers such a vectorised solution.

The equivalent in regular python is below, given a list lst. Here each element of lst is passed through the lambda function and aggregated in a list.

list(map(lambda x: 1 if x else 0, lst))

It is a Pythonic idiom to test for "Truthy" values using if x rather than if x == True, see this answer for more information on what is considered True.

Upvotes: 6

jezrael

Reputation: 863166

I think apply are loops under the hood, better is use vectorized astype - it convert True to 1 and False to 0:

Top15['HighRenew'] = (Top15['% Renewable']>=median_value).astype(int)

lambda x:1 if x else 0

means anonymous function (lambda function) with condition - if True return 1 else return 0.

For more information about lambda function check this answers.

Upvotes: 3

understanding lambda functions in pandas

Answers (3)

Related Questions