user1403546
user1403546

Reputation: 1749

functional programming and python pandas dataframes in pipelines

I was wondering which is the best practice of functional-programming to write a pipeline of functions which process pandas dataframes - or any other mutable input types - as input of functions.

Here are 2 ideas but hope that something better exists :)

idea # 1 - no functional programming but saving memory

def foo(df, param):
    df['col'] = df['col'] + param

def pipeline(df):
    foo(df, 1)
    foo(df, 2)
    foo(df, 3)

idea # 2 - more functional programming but wasting memory by doing .copy()

def foo(df, param):
    df = df.copy()
    df['col'] = df['col'] + param
    return df

def pipeline(df):
    df1 = foo(df, 1)
    df2 = foo(df1, 2)
    df3 = foo(df2, 3)

Upvotes: 9

Views: 5098

Answers (2)

Elmex80s
Elmex80s

Reputation: 3504

If your dataframe is 1-dimensional (that is it is a list of items) then you can instead of applying multiple 'pipes' do everything in one single go. For example we are given a dataframe

>>> table
     Name  Year of Birth      City
0    Mike           1970  New York
1   Chris           1981     Miami
2  Janine           1975   Seattle

We want to calculate the age and check if one lives in Miami. You can apply 2 pipes but it can be done with a single lambda:

>>> f = lambda s: pd.Series([2019 - s['Year of Birth'], s['City'] == 'Miami'])

Then apply it

>>> table[['Age', 'Lives in Miami']] = table.apply(f, axis=1)
>>> table
     Name  Year of Birth      City  Age  Lives in Miami
0    Mike           1970  New York   49           False
1   Chris           1981     Miami   38            True
2  Janine           1975   Seattle   44           False

All the work is done in f.

Another example: Say you want to check who is over 45 years old. This contains two serial operations (the previous example was 2 paralel operations):

  1. Calculate age
  2. Check if age is bigger than 45

You could use 2 pipes but again it can be done by applying a single lambda which does it all:

>>> table
     Name  Year of Birth      City   
0    Mike           1970  New York            
1   Chris           1981     Miami           
2  Janine           1975   Seattle      
>>> g = lambda s: pd.Series([2019 - s['Year of Birth'] > 45])              
>>> table['Older than 45'] = table.apply(g, axis=1)
>>> table
     Name  Year of Birth      City  Older than 45
0    Mike           1970  New York           True
1   Chris           1981     Miami          False
2  Janine           1975   Seattle          False

If you prefer immutability then do

>>> table
     Name  Year of Birth      City
0    Mike           1970  New York
1   Chris           1981     Miami
2  Janine           1975   Seattle 
>>> pd.concat([table, table.apply(f, axis=1)], axis=1)
     Name  Year of Birth      City   0      1
0    Mike           1970  New York  49  False
1   Chris           1981     Miami  38   True
2  Janine           1975   Seattle  44  False

If you don't like lambdas then define

def f(s):
    return pd.Series([2019 - s['Year of Birth'], 
                      s['City'] == 'Miami'])

This all only works when you want to apply a mapping which only needs to work on a single individual row. Also, I don't make any claims on speed, pipes might be faster, however this is very readable.

(And in the end you realize you'd better of using Haskell.)

Upvotes: 4

T Burgis
T Burgis

Reputation: 1435

You can chain function calls operating on the dataframe. Also take a look at DataFrame.pipe in pandas. Something like this, adding in a couple of non-foo operations:

df = (df.pipe(foo,1)
      .pipe(foo,2)
      .pipe(foo,3)
      .drop(columns=['drop','these'])
      .assign(NEW_COL=lambda x: x['OLD_COL'] / 10))

df will be the first argument passed to foo when you use pipe.

Upvotes: 7

Related Questions