jamie_y
jamie_y

Reputation: 1919

Python Pandas: .apply taking forever?

I have a DataFrame 'clicks' created by parsing CSV of size 1.4G. I'm trying to create a new column 'bought' using apply function.

clicks['bought'] = clicks['session'].apply(getBoughtItemIDs)

In getBoughtItemIDs, I'm checking if 'buys' dataframe has values I want, and if so, return a string concatenating them. The first line in getBoughtItemIDs is taking forever. What are the ways to make it faster?

def getBoughtItemIDs(val):
  boughtSessions = buys[buys['session'] == val].values
  output = ''
  for row in boughtSessions:
    output += str(row[1]) + ","
  return output

Upvotes: 3

Views: 5891

Answers (1)

Alex Riley
Alex Riley

Reputation: 176810

There are a couple of things that make this code run slowly.

  • apply is essentially just syntactic sugar for a for loop over the rows of a column. There's also an explicit for loop over a NumPy array in your function (the for row in boughtSessions part). Looping in this (non-vectorised) way is best avoided whenever possible as it impacts performance heavily.

  • buys[buys['session'] == val].values is looking up val across an entire column for each row of clicks, then returning a sub-DataFrame and then creating a new NumPy array. Repeatedly looking for values in this way is expensive (O(n) complexity each lookup). Creating new arrays is going to be expensive since memory has to be allocated and the data copied across each time.

If I understand what you're trying to do, you could try the following approach to get your new column.

First use groupby to group the rows of buys by the values in 'session'. apply is used to join up the strings for each value:

boughtSessions = buys.groupby('session')[col_to_join].apply(lambda x: ','.join(x))

where col_to_join is the column from buys which contains the values you want to join together into a string.

groupby means that only one pass through the DataFrame is needed and is pretty well-optimised in Pandas. The use of apply to join the strings is unavoidable here, but only one pass through the grouped values is needed.

boughtSessions is now a Series of strings indexed by the unique values in the 'session' column. This is useful because lookups to Pandas indexes are O(1) in complexity.

To match each string in boughtSessions to the approach value in clicks['session'] you can use map. Unlike apply, map is fully vectorised and should be very fast:

clicks['bought'] = clicks['session'].map(boughtSessions)

Upvotes: 5

Related Questions