user3376169
user3376169

Reputation: 429

How to pass more than one parameter to map function in panda

I have a following dataFrame

mn = pd.DataFrame({'fld1': [2.23, 4.45, 7.87, 9.02, 8.85, 3.32, 5.55],'fld2': [125000, 350000,700000, 800000, 200000, 600000, 500000],'lType': ['typ1','typ2','typ3','typ1','typ3','typ1','typ2'], 'counter': [100,200,300,400,500,600,700]})

Mapping function

def getTag(rangeAttribute):
    sliceDef = {'tag1': [1, 4], 'tag2': [4, 6], 'tag3': [6, 9],
                'tag4': [9, 99]}
    for sl in sliceDef.keys():
        bounds = sliceDef[sl]
        if ((float(rangeAttribute) >= float(bounds[0]))
            and (float(rangeAttribute) <= float(bounds[1]))):
            return sl


def getTag1(rangeAttribute):
    sliceDef = {'100-150': [100000, 150000],
                '150-650': [150000, 650000],
                '650-5M': [650000, 5000000]}
    for sl in sliceDef.keys():
        bounds = sliceDef[sl]
        if ((float(rangeAttribute) >= float(bounds[0]))
            and (float(rangeAttribute) <= float(bounds[1]))):
            return sl

I want to compute sum based on the tags for fld1 and fld2. At present I have to write different functions with hardcoded values for different types of field. MAP function only takes 1 argument. Is there any other function other than MAP
that can also take sliceDef as input param.

mn.groupby([mn['fld1'].map(getTag),mn['fld2'].map(getTag1),'lType'] ).sum()

Upvotes: 4

Views: 4753

Answers (1)

unutbu
unutbu

Reputation: 880329

Instead of using map, you could use pd.cut (Thanks to DSM and Jeff for pointing this out):

import numpy as np
import pandas as pd

mn = pd.DataFrame(
    {'fld1': [2.23, 4.45, 7.87, 9.02, 8.85, 3.32, 5.55],
     'fld2': [125000, 350000, 700000, 800000, 200000, 600000, 500000],
     'lType': ['typ1', 'typ2', 'typ3', 'typ1', 'typ3', 'typ1', 'typ2'],
     'counter': [100, 200, 300, 400, 500, 600, 700]})

result = mn.groupby(
    [pd.cut(mn['fld1'], [1,4,6,9,99], labels=['tag1', 'tag2', 'tag3', 'tag4']),
     pd.cut(mn['fld2'], [100000, 150000, 650000, 5000000],
            labels=['100-150', '150-650', '650-5M']),
     'lType']).sum()

print(result)

yields

                    counter   fld1    fld2
             lType                        
tag1 100-150 typ1       100   2.23  125000
     150-650 typ1       600   3.32  600000
tag2 150-650 typ2       900  10.00  850000
tag3 150-650 typ3       500   8.85  200000
     650-5M  typ3       300   7.87  700000
tag4 650-5M  typ1       400   9.02  800000

This will be quicker than calling getTag or getTag1 once for every value in series. Instead, pd.cut uses np.searchsorted which returns all the indices with just one call (and moreover, searchsorted uses O(log n) binary search written in C instead of an O(n) loop written in Python).


A subtle point: The keys returned by sliceDef.keys() are not guaranteed to be in any particular order. It could change even from run to run (at least with Python3). Your criterion uses fully closed intervals:

    if ((float(rangeAttribute) >= float(bounds[0]))
        and (float(rangeAttribute) <= float(bounds[1]))):

so it might matter which key is tested first if rangeAttribute happens to fall on one of the values in bounds.

So your current code is non-deterministic.

pd.cut uses half-open intervals, so each value will fall into one and only one category, thus avoiding the problem.


And to answer the general question: Yes, there is a way to pass extra arguments -- use apply instead of map (Thanks to Andy Hayden for pointing this out):

import numpy as np
import pandas as pd

def getTag(rangeAttribute, sliceDef):
    for sl in sliceDef.keys():
        bounds = sliceDef[sl]
        if ((float(rangeAttribute) >= float(bounds[0]))
            and (float(rangeAttribute) <= float(bounds[1]))):
            return sl

sliceDef = {'tag1': [1, 4], 'tag2': [4, 6], 'tag3': [6, 9],
            'tag4': [9, 99]}
sliceDef1 = {'100-150': [100000, 150000],
            '150-650': [150000, 650000],
            '650-5M': [650000, 5000000]}

mn = pd.DataFrame(
    {'fld1': [2.23, 4.45, 7.87, 9.02, 8.85, 3.32, 5.55],
     'fld2': [125000, 350000, 700000, 800000, 200000, 600000, 500000],
     'lType': ['typ1', 'typ2', 'typ3', 'typ1', 'typ3', 'typ1', 'typ2'],
     'counter': [100, 200, 300, 400, 500, 600, 700]})

result = mn.groupby([mn['fld1'].apply(getTag, args=(sliceDef, ))
                     ,mn['fld2'].apply(getTag, args=(sliceDef1, )),
                     'lType'] ).sum()
print(result)

Still, I don't recommend using apply for this particular problem since pd.cut is be faster, easier to use, and avoids the non-deterministic order of dict keys problem. But knowing that apply can take additional positional arguments may help you in the future.


Upvotes: 5

Related Questions