Lzkatz
Lzkatz

Reputation: 173

Apply a set of functions to a dataframe

I'm trying to figure out the best/most efficient way to functionalize dataframe transformations based on some sort of user "config".

The idea behind the config is to have a default set of transformations, then allow the user to customize additional transformations and add them to the dictionary. Then they'd all get applied at "runtime".

I'd like to have some sort of dataframe

    'a' 'b'
0    1   2
1    1   0
2    3   1

and some sort of dictionary of functions to apply. Ideally, one that accepts both unary and non-unary functions.

A setup that can do both something like

df['bool'] = df['a'] > 1

and

df['sumval'] = df['a'] + df['b']

Right now, I have a something like

config = {'bool': (lambda x: x['a'] > 1),
          'sumval': (lambda x: x['a'] + x['b']}

for k, f in config.iteritems():
    df[k] = df.apply(f, axis=1)

However, I realize this is pretty horribly inefficient, especially as the number of functions increases. Since apply loops over each row every time, in addition to losing the benefits of broadcast/whatever you call it inherent in things like the boolean statement above.

Any thoughts?

Upvotes: 1

Views: 391

Answers (2)

piRSquared
piRSquared

Reputation: 294338

You can track formulas in config as a list. Then you can pass them to the eval method separated by '\n'.

config = ['bool = a > 1', 'sumVal = a + b']

df.eval('\n'.join(config), inplace=False)

   a  b   bool  sumVal
0  1  2  False       3
1  1  0  False       1
2  3  1   True       4

You can see that

'\n'.join(config)

Evaluates to

'bool = a > 1\nsumVal = a + b'

Or similarly with a dictionary

config = {'bool': 'a > 1', 'sumVal': 'a + b'}

j, f = '\n'.join, '{} = {}'.format

df.eval(j([f(k, v) for k, v in config.items()]), inplace=False)

   a  b   bool  sumVal
0  1  2  False       3
1  1  0  False       1
2  3  1   True       4

In this case

j([f(k, v) for k, v in config.items()])

Also evaluates to

'bool = a > 1\nsumVal = a + b'

Upvotes: 2

cs95
cs95

Reputation: 402583

Why not just ditch the apply and pass the entire dataframe to the lambda?

for k, f in config.iteritems():
    df[k] = f(df)

print(df)

   a  b  sumval   bool
0  1  2       3  False
1  1  0       1  False
2  3  1       4   True

Let pandas' broadcasted operation do the magic.

Upvotes: 2

Related Questions