Reputation: 641
I'm trying to solve a problem for a course in Python and found someone has implemented solutions for the same problem in github. I'm just trying to understand the solution given in github.
I have a pandas dataframe called Top15 with 15 countries and one of the columns in the dataframe is 'HighRenew'. This column stores the % of renewable energy used in each country. My task is to convert the column values in 'HighRenew' column into boolean datatype.
If the value for a particular country is higher than the median renewable energy percentage in all the 15 countries then I should encode it as 1 otherwise it should a 0. The 'HighRenew' column is sliced out as a Series from the dataframe, which is copied below.
Country
China True
United States False
Japan False
United Kingdom False
Russian Federation True
Canada True
Germany True
India False
France True
South Korea False
Italy True
Spain True
Iran False
Australia False
Brazil True
Name: HighRenew, dtype: bool
The github solution is implemented in 3 steps, of which I understand the first 2 but not the last one where lambda function is used. Can someone explain how this lambda function works?
median_value = Top15['% Renewable'].median()
Top15['HighRenew'] = Top15['% Renewable']>=median_value
Top15['HighRenew'] = Top15['HighRenew'].apply(lambda x:1 if x else 0)
Upvotes: 3
Views: 16804
Reputation: 2392
Instead of using workarounds or lambdas, just use Panda's built-in functionality meant for this problem. The approach is called masking, and in essence we use comparators against a Series
(column of a df) to get the boolean values:
import pandas as pd
import numpy as np
foo = [{
'Country': 'Germany',
'Percent Renew': 100
}, {
'Country': 'Germany',
'Percent Renew': 75
}, {
'Country': 'China',
'Percent Renew': 25
}, {
'Country': 'USA',
'Percent Renew': 5
}]
df = pd.DataFrame(foo, index=pd.RangeIndex(0, len(foo)))
df
#| Country | Percent Renew |
#| Germany | 100 |
#| Australia | 75 |
#| China | 25 |
#| USA | 5 |
np.mean(df['Percent Renew'])
# 51.25
df['Better Than Average'] = df['Percent Renew'] > np.mean(df['Percent Renew'])
#| Country | Percent Renew | Better Than Average |
#| Germany | 100 | True
#| Australia | 75 | True
#| China | 25 | False
#| USA | 5 | False
The reason specifically why I propose this over the other solutions is that masking can be used for a host of other purposes as well. I wont get into them here, but once you learn that pandas supports this kind of functionality, it becomes a lot easier to perform other data manipulations in pandas.
EDIT: I read needing boolean
datatype as needing True
False
and not as needing the encoded version 1
and 0
in which case the astype
that was proposed will sufficiently convert the booleans to integer values. For masking purposes though, the 'True' 'False' is needed for slicing.
Upvotes: 0
Reputation: 164773
lambda
represents an anonymous (i.e. unnamed) function. If it is used with pd.Series.apply
, each element of the series is fed into the lambda
function. The result will be another pd.Series
with each element run through the lambda
.
apply
+ lambda
is just a thinly veiled loop. You should prefer to use vectorised functionality where possible. @jezrael offers such a vectorised solution.
The equivalent in regular python is below, given a list lst
. Here each element of lst
is passed through the lambda
function and aggregated in a list.
list(map(lambda x: 1 if x else 0, lst))
It is a Pythonic idiom to test for "Truthy" values using if x
rather than if x == True
, see this answer for more information on what is considered True
.
Upvotes: 6
Reputation: 863166
I think apply
are loops under the hood, better is use vectorized astype
- it convert True
to 1
and False
to 0
:
Top15['HighRenew'] = (Top15['% Renewable']>=median_value).astype(int)
lambda x:1 if x else 0
means anonymous function (lambda
function) with condition - if True
return 1
else return 0
.
For more information about lambda
function check this answers.
Upvotes: 3