Reputation: 1749
I was wondering which is the best practice of functional-programming to write a pipeline of functions which process pandas dataframes - or any other mutable input types - as input of functions.
Here are 2 ideas but hope that something better exists :)
idea # 1 - no functional programming but saving memory
def foo(df, param):
df['col'] = df['col'] + param
def pipeline(df):
foo(df, 1)
foo(df, 2)
foo(df, 3)
idea # 2 - more functional programming but wasting memory by doing .copy()
def foo(df, param):
df = df.copy()
df['col'] = df['col'] + param
return df
def pipeline(df):
df1 = foo(df, 1)
df2 = foo(df1, 2)
df3 = foo(df2, 3)
Upvotes: 9
Views: 5098
Reputation: 3504
If your dataframe is 1-dimensional (that is it is a list of items) then you can instead of applying multiple 'pipes' do everything in one single go. For example we are given a dataframe
>>> table
Name Year of Birth City
0 Mike 1970 New York
1 Chris 1981 Miami
2 Janine 1975 Seattle
We want to calculate the age and check if one lives in Miami. You can apply 2 pipes but it can be done with a single lambda:
>>> f = lambda s: pd.Series([2019 - s['Year of Birth'], s['City'] == 'Miami'])
Then apply it
>>> table[['Age', 'Lives in Miami']] = table.apply(f, axis=1)
>>> table
Name Year of Birth City Age Lives in Miami
0 Mike 1970 New York 49 False
1 Chris 1981 Miami 38 True
2 Janine 1975 Seattle 44 False
All the work is done in f
.
Another example: Say you want to check who is over 45 years old. This contains two serial operations (the previous example was 2 paralel operations):
You could use 2 pipes but again it can be done by applying a single lambda which does it all:
>>> table
Name Year of Birth City
0 Mike 1970 New York
1 Chris 1981 Miami
2 Janine 1975 Seattle
>>> g = lambda s: pd.Series([2019 - s['Year of Birth'] > 45])
>>> table['Older than 45'] = table.apply(g, axis=1)
>>> table
Name Year of Birth City Older than 45
0 Mike 1970 New York True
1 Chris 1981 Miami False
2 Janine 1975 Seattle False
If you prefer immutability then do
>>> table
Name Year of Birth City
0 Mike 1970 New York
1 Chris 1981 Miami
2 Janine 1975 Seattle
>>> pd.concat([table, table.apply(f, axis=1)], axis=1)
Name Year of Birth City 0 1
0 Mike 1970 New York 49 False
1 Chris 1981 Miami 38 True
2 Janine 1975 Seattle 44 False
If you don't like lambdas then define
def f(s):
return pd.Series([2019 - s['Year of Birth'],
s['City'] == 'Miami'])
This all only works when you want to apply a mapping which only needs to work on a single individual row. Also, I don't make any claims on speed, pipes might be faster, however this is very readable.
(And in the end you realize you'd better of using Haskell.)
Upvotes: 4
Reputation: 1435
You can chain function calls operating on the dataframe. Also take a look at DataFrame.pipe
in pandas. Something like this, adding in a couple of non-foo operations:
df = (df.pipe(foo,1)
.pipe(foo,2)
.pipe(foo,3)
.drop(columns=['drop','these'])
.assign(NEW_COL=lambda x: x['OLD_COL'] / 10))
df
will be the first argument passed to foo
when you use pipe
.
Upvotes: 7