Reputation: 1045
Some stylized data to start with:
testdf = pd.DataFrame(data = [(1, 'AA', 'ServiceA'), (2, 'BB', 'ServiceB'), (3, 'CC', 'ServiceA'), (4, 'DD', 'ServiceD')],
columns=['Rev', 'Pnum', 'Service'])
Rev Pnum Service
0 1 AA ServiceA
1 2 BB ServiceB
2 3 CC ServiceA
3 4 DD ServiceD
To apportion value of services we have:
pnumlist = ['AA', 'CC']
servicelist = ['ServiceA', 'ServiceB', 'ServiceC', 'ServiceD']
I'm trying to write a Pythonic function that'd take above df and return another df based on:
testdf['Charge'] = testdf['Rev'] if testdf['Pnum'] in pnumlist else 0 #doesn't work, throws truth value ambiguous error
The returned df should also have columns for counts of the various services in each line of the testdf, so it should look like:
outputdf = pd.DataFrame(data = [(1, 1, 0, 0, 0), (0, 0, 1, 0, 0), (3, 1, 0, 0, 0), (0, 0, 0, 0, 1)],
columns = ['Charge', 'Acount', 'Bcount', 'Ccount', 'Dcount'])
At the moment I have a rowhandler func that handles each row of testdf and then I call apply with this df passing the rowhandlder func:
def rowhandler(testdfrow: tuple) -> tuple:
testdfrow['Charge'] = testdfrow['Rev'] if testdfrow['Pnum'] in pnumlist else 0
for service in servicelist:
testdfrow['{}count'.format(service)] = 1 if service in testdfrow['Service'] else 0
return testdfrow
newcolslist = ['Charge']
newcolsdict = {col: 0 for col in newcolslist}
testdf = testdf.assign(**newcolsdict) #pre-allocating memory speeds up program
testdf = testdf.apply(rowhandler, axis = 1)
The rowhandler func has several other columns in real situation and the data size is also large. So I'm looking for ways to speed it up and I think it can be done by vectorizing the rowhandler func. Any suggestions appreciated, thanks
Upvotes: 1
Views: 56
Reputation: 2135
You could just edit your dataframe in place using column based operations. For example:
testdf["Charge"] = testdf["Rev"].where(testdf["Pnum"].isin(pnumlist), 0)
for service in servicelist:
testdf["{}_count".format(service)] = testdf["Service"].str.contains(service).astype(int)
Here's some performance comparisons:
import timeit
testdf = pd.concat([testdf for i in range(10000)], ignore_index=True)
def fix_df():
global testdf
testdf["Charge"] = testdf["Rev"].where(testdf.Pnum.isin(pnumlist), 0)
for service in servicelist:
testdf["{}count".format(service)] = (
testdf["Service"].str.contains(service).astype(int)
)
return testdf
def fix_df_orig():
global testdf
def rowhandler(testdfrow: tuple) -> tuple:
testdfrow["Charge"] = testdfrow["Rev"] if testdfrow["Pnum"] in pnumlist else 0
for service in servicelist:
testdfrow["{}count".format(service)] = (
1 if service in testdfrow["Service"] else 0
)
return testdfrow
newcolslist = ["Charge"]
newcolsdict = {col: 0 for col in newcolslist}
testdf = testdf.assign(**newcolsdict) # pre-allocating memory speeds up program
testdf = testdf.apply(rowhandler, axis=1)
In [1]: timeit.timeit(fix_df, number=1)
Out[1]: 0.06966943305451423
In [2]: timeit.timeit(fix_df_orig, number=1)
Out[2]: 109.82892861706205
Seems like quite the improvement.
edit: I updated my answer with a slightly more performant answer
Upvotes: 0
Reputation: 323326
Is this what you need get_dummies
with concat
?
s1=testdf[['Rev']].where(testdf.Pnum.isin(pnumlist),0)
s2=testdf['Service'].where(testdf['Service'].isin(servicelist)).str.get_dummies()
df=pd.concat([s1,s2.reindex(columns=servicelist,fill_value=0)],1)
df
Out[563]:
Rev ServiceA ServiceB ServiceC ServiceD
0 1 1 0 0 0
1 0 0 1 0 0
2 3 1 0 0 0
3 0 0 0 0 1
Upvotes: 1