Uriel Vinetz
Uriel Vinetz

Reputation: 21

Pandas apply function running slow

I have this function that given an id, a number n and a dataframe returns the nth element on column "something" where the "id" is the id in the params.

def find_something(id,n,df):
  table = df.loc[(df['id'] == id)]
  try:
      something = df['something'].iloc[n-1]
  except:
      something = float('NaN')
  return something

When I run this for 1 id (the id is of format np.int32 and the df in the params has 20 million rows) it runs in 11.4 ns, but when I apply it to a dataframe column with 60K rows it takes hours to run:

my_table['new_column'] = my_table['id'].apply(find_something, args=(1,df,))

So if I have:

df = pd.DataFrame({'id' : [1, 2, 2, 2,
                          2,1,2,2],
                   'something' : np.random.randn(8)})

And:

my_table = pd.DataFrame({'id' : [1, 2]})

my_table['new_column'] = my_table['id'].apply(find_something, args=(1,df,))

my_table should look like:

    id  new_column
0   1   -0.396238
1   2    0.074007

Is there a more efficient way to do this? I don't see any reason why for 1 element it takes 11 ns but for 60K it takes hours.

Upvotes: 0

Views: 668

Answers (1)

I generated a similar dataset with 20 million rows and 60K IDs and ran it through your code; it took about an hour to finish. Generally, user-defined functions suffer from the lack of speed as apply() does not take advantage of Pandas’ vectorization. If executing apply() with large datasets is a major pain point for you, you should consider alternative solutions such as Bodo. I ran the same code through Bodo; it only took about 1.5 minutes to finish. Essentially, Bodo optimizes your apply() code to maintain the vectorization provided while providing access to scientifically correct parallelization methods. The community edition of Bodo enables you to run on up to 4 cores. Here is a link to the installation page: https://docs.bodo.ai/latest/source/install.html

#data generation

import pandas as pd
import numpy as np
import time

df = pd.DataFrame({'id' : np.random.randint(1,60000,20000000),
                   'something' : np.random.randn(20000000)})
my_table = pd.DataFrame({'id' : np.arange(1, 60000)})

my_table.to_parquet("table.pq")
df.to_parquet("df.pq")

With Pandas (I did some minor changes in your code to make it more robust):

def find_something(id,n,df):
    df = df.loc[(df['id'] == id)]
    if len(df) != 0:
        result = df['something'].iloc[n-1]
    else:
        result = np.nan
    return result

start = time.time()

df = pd.read_parquet("df.pq")
my_table = pd.read_parquet("table.pq")
my_table['new_column'] = my_table['id'].apply(find_something, args=(1,df,))

end = time.time()
print("computation time: ", end - start)

print(my_table.head())

output:
computation time:  3482.743801832199
   id  new_column
0   1   -1.096224
1   2    0.667792
2   3    1.069627
3   4    0.129955
4   5    0.150882

With Bodo:

%%px

import pandas as pd
import numpy as np
import time
import bodo

@bodo.jit(distributed = ['df', 'result'])
def find_something(id,n,df):
    df = df.loc[(df['id'] == id)]
    if len(df) != 0:
        result = df['something'].iloc[n-1]
    else:
        result = np.nan
    return result

@bodo.jit(distributed = ['my_table', 'df'])
def new_column():
    start = time.time()
    df = pd.read_parquet("df.pq")
    my_table = pd.read_parquet("table.pq")
    my_table['new_column'] = my_table['id'].apply(find_something, args=(1,df,))
    end = time.time()
    print("computation time: ", end - start)
    print(my_table.head())
    return my_table
    
my_table = new_column()

output:
[stdout:0] 
computation time:  103.9169020652771
 id  new_column
0   1   -1.096224
1   2    0.667792
2   3    1.069627
3   4    0.129955
4   5    0.150882

Disclaimer: I work as a data scientist advocate in Bodo.ai.

Upvotes: 1

Related Questions