Reputation: 429
I have a following dataFrame
mn = pd.DataFrame({'fld1': [2.23, 4.45, 7.87, 9.02, 8.85, 3.32, 5.55],'fld2': [125000, 350000,700000, 800000, 200000, 600000, 500000],'lType': ['typ1','typ2','typ3','typ1','typ3','typ1','typ2'], 'counter': [100,200,300,400,500,600,700]})
Mapping function
def getTag(rangeAttribute):
sliceDef = {'tag1': [1, 4], 'tag2': [4, 6], 'tag3': [6, 9],
'tag4': [9, 99]}
for sl in sliceDef.keys():
bounds = sliceDef[sl]
if ((float(rangeAttribute) >= float(bounds[0]))
and (float(rangeAttribute) <= float(bounds[1]))):
return sl
def getTag1(rangeAttribute):
sliceDef = {'100-150': [100000, 150000],
'150-650': [150000, 650000],
'650-5M': [650000, 5000000]}
for sl in sliceDef.keys():
bounds = sliceDef[sl]
if ((float(rangeAttribute) >= float(bounds[0]))
and (float(rangeAttribute) <= float(bounds[1]))):
return sl
I want to compute sum based on the tags for fld1 and fld2.
At present I have to write different functions with hardcoded values for different types of field. MAP function only takes 1 argument. Is there any other function other than MAP
that can also take sliceDef as input param.
mn.groupby([mn['fld1'].map(getTag),mn['fld2'].map(getTag1),'lType'] ).sum()
Upvotes: 4
Views: 4753
Reputation: 880329
Instead of using map, you could use pd.cut (Thanks to DSM and Jeff for pointing this out):
import numpy as np
import pandas as pd
mn = pd.DataFrame(
{'fld1': [2.23, 4.45, 7.87, 9.02, 8.85, 3.32, 5.55],
'fld2': [125000, 350000, 700000, 800000, 200000, 600000, 500000],
'lType': ['typ1', 'typ2', 'typ3', 'typ1', 'typ3', 'typ1', 'typ2'],
'counter': [100, 200, 300, 400, 500, 600, 700]})
result = mn.groupby(
[pd.cut(mn['fld1'], [1,4,6,9,99], labels=['tag1', 'tag2', 'tag3', 'tag4']),
pd.cut(mn['fld2'], [100000, 150000, 650000, 5000000],
labels=['100-150', '150-650', '650-5M']),
'lType']).sum()
print(result)
yields
counter fld1 fld2
lType
tag1 100-150 typ1 100 2.23 125000
150-650 typ1 600 3.32 600000
tag2 150-650 typ2 900 10.00 850000
tag3 150-650 typ3 500 8.85 200000
650-5M typ3 300 7.87 700000
tag4 650-5M typ1 400 9.02 800000
This will be quicker than calling getTag
or getTag1
once for every value in series. Instead, pd.cut
uses np.searchsorted which returns all the indices with just one call (and moreover, searchsorted
uses O(log n) binary search written in C instead of an O(n) loop written in Python).
A subtle point: The keys returned by sliceDef.keys()
are not guaranteed to be in any particular order. It could change even from run to run (at least with Python3). Your criterion uses fully closed intervals:
if ((float(rangeAttribute) >= float(bounds[0]))
and (float(rangeAttribute) <= float(bounds[1]))):
so it might matter which key is tested first if rangeAttribute
happens to fall on one of the values in bounds
.
So your current code is non-deterministic.
pd.cut
uses half-open intervals, so each value will fall into one and only one category, thus avoiding the problem.
And to answer the general question: Yes, there is a way to pass extra arguments -- use apply instead of map (Thanks to Andy Hayden for pointing this out):
import numpy as np
import pandas as pd
def getTag(rangeAttribute, sliceDef):
for sl in sliceDef.keys():
bounds = sliceDef[sl]
if ((float(rangeAttribute) >= float(bounds[0]))
and (float(rangeAttribute) <= float(bounds[1]))):
return sl
sliceDef = {'tag1': [1, 4], 'tag2': [4, 6], 'tag3': [6, 9],
'tag4': [9, 99]}
sliceDef1 = {'100-150': [100000, 150000],
'150-650': [150000, 650000],
'650-5M': [650000, 5000000]}
mn = pd.DataFrame(
{'fld1': [2.23, 4.45, 7.87, 9.02, 8.85, 3.32, 5.55],
'fld2': [125000, 350000, 700000, 800000, 200000, 600000, 500000],
'lType': ['typ1', 'typ2', 'typ3', 'typ1', 'typ3', 'typ1', 'typ2'],
'counter': [100, 200, 300, 400, 500, 600, 700]})
result = mn.groupby([mn['fld1'].apply(getTag, args=(sliceDef, ))
,mn['fld2'].apply(getTag, args=(sliceDef1, )),
'lType'] ).sum()
print(result)
Still, I don't recommend using apply
for this particular problem since pd.cut
is be faster, easier to use, and avoids the non-deterministic order of dict keys problem. But knowing that apply
can take additional positional arguments may help you in the future.
Upvotes: 5