functional programming and python pandas dataframes in pipelines

Question

I was wondering which is the best practice of functional-programming to write a pipeline of functions which process pandas dataframes - or any other mutable input types - as input of functions.

Here are 2 ideas but hope that something better exists :)

idea # 1 - no functional programming but saving memory

def foo(df, param):
    df['col'] = df['col'] + param

def pipeline(df):
    foo(df, 1)
    foo(df, 2)
    foo(df, 3)

idea # 2 - more functional programming but wasting memory by doing .copy()

def foo(df, param):
    df = df.copy()
    df['col'] = df['col'] + param
    return df

def pipeline(df):
    df1 = foo(df, 1)
    df2 = foo(df1, 2)
    df3 = foo(df2, 3)

Elmex80s · Accepted Answer

If your dataframe is 1-dimensional (that is it is a list of items) then you can instead of applying multiple 'pipes' do everything in one single go. For example we are given a dataframe

>>> table
     Name  Year of Birth      City
0    Mike           1970  New York
1   Chris           1981     Miami
2  Janine           1975   Seattle

We want to calculate the age and check if one lives in Miami. You can apply 2 pipes but it can be done with a single lambda:

>>> f = lambda s: pd.Series([2019 - s['Year of Birth'], s['City'] == 'Miami'])

Then apply it

>>> table[['Age', 'Lives in Miami']] = table.apply(f, axis=1)
>>> table
     Name  Year of Birth      City  Age  Lives in Miami
0    Mike           1970  New York   49           False
1   Chris           1981     Miami   38            True
2  Janine           1975   Seattle   44           False

All the work is done in f.

Another example: Say you want to check who is over 45 years old. This contains two serial operations (the previous example was 2 paralel operations):

Calculate age
Check if age is bigger than 45

You could use 2 pipes but again it can be done by applying a single lambda which does it all:

>>> table
     Name  Year of Birth      City   
0    Mike           1970  New York            
1   Chris           1981     Miami           
2  Janine           1975   Seattle      
>>> g = lambda s: pd.Series([2019 - s['Year of Birth'] > 45])              
>>> table['Older than 45'] = table.apply(g, axis=1)
>>> table
     Name  Year of Birth      City  Older than 45
0    Mike           1970  New York           True
1   Chris           1981     Miami          False
2  Janine           1975   Seattle          False

If you prefer immutability then do

>>> table
     Name  Year of Birth      City
0    Mike           1970  New York
1   Chris           1981     Miami
2  Janine           1975   Seattle 
>>> pd.concat([table, table.apply(f, axis=1)], axis=1)
     Name  Year of Birth      City   0      1
0    Mike           1970  New York  49  False
1   Chris           1981     Miami  38   True
2  Janine           1975   Seattle  44  False

If you don't like lambdas then define

def f(s):
    return pd.Series([2019 - s['Year of Birth'], 
                      s['City'] == 'Miami'])

This all only works when you want to apply a mapping which only needs to work on a single individual row. Also, I don't make any claims on speed, pipes might be faster, however this is very readable.

(And in the end you realize you'd better of using Haskell.)

functional programming and python pandas dataframes in pipelines

Answers (2)

Related Questions