Reputation: 1919
I have a DataFrame 'clicks' created by parsing CSV of size 1.4G. I'm trying to create a new column 'bought' using apply function.
clicks['bought'] = clicks['session'].apply(getBoughtItemIDs)
In getBoughtItemIDs, I'm checking if 'buys' dataframe has values I want, and if so, return a string concatenating them. The first line in getBoughtItemIDs is taking forever. What are the ways to make it faster?
def getBoughtItemIDs(val):
boughtSessions = buys[buys['session'] == val].values
output = ''
for row in boughtSessions:
output += str(row[1]) + ","
return output
Upvotes: 3
Views: 5891
Reputation: 176810
There are a couple of things that make this code run slowly.
apply
is essentially just syntactic sugar for a for
loop over the rows of a column. There's also an explicit for
loop over a NumPy array in your function (the for row in boughtSessions
part). Looping in this (non-vectorised) way is best avoided whenever possible as it impacts performance heavily.
buys[buys['session'] == val].values
is looking up val
across an entire column for each row of clicks
, then returning a sub-DataFrame and then creating a new NumPy array. Repeatedly looking for values in this way is expensive (O(n)
complexity each lookup). Creating new arrays is going to be expensive since memory has to be allocated and the data copied across each time.
If I understand what you're trying to do, you could try the following approach to get your new column.
First use groupby
to group the rows of buys
by the values in 'session'. apply
is used to join up the strings for each value:
boughtSessions = buys.groupby('session')[col_to_join].apply(lambda x: ','.join(x))
where col_to_join
is the column from buys
which contains the values you want to join together into a string.
groupby
means that only one pass through the DataFrame is needed and is pretty well-optimised in Pandas. The use of apply
to join the strings is unavoidable here, but only one pass through the grouped values is needed.
boughtSessions
is now a Series of strings indexed by the unique values in the 'session' column. This is useful because lookups to Pandas indexes are O(1)
in complexity.
To match each string in boughtSessions
to the approach value in clicks['session']
you can use map
. Unlike apply
, map
is fully vectorised and should be very fast:
clicks['bought'] = clicks['session'].map(boughtSessions)
Upvotes: 5