Reputation: 21
I have this function that given an id, a number n and a dataframe returns the nth element on column "something" where the "id" is the id in the params.
def find_something(id,n,df):
table = df.loc[(df['id'] == id)]
try:
something = df['something'].iloc[n-1]
except:
something = float('NaN')
return something
When I run this for 1 id (the id is of format np.int32 and the df in the params has 20 million rows) it runs in 11.4 ns, but when I apply it to a dataframe column with 60K rows it takes hours to run:
my_table['new_column'] = my_table['id'].apply(find_something, args=(1,df,))
So if I have:
df = pd.DataFrame({'id' : [1, 2, 2, 2,
2,1,2,2],
'something' : np.random.randn(8)})
And:
my_table = pd.DataFrame({'id' : [1, 2]})
my_table['new_column'] = my_table['id'].apply(find_something, args=(1,df,))
my_table should look like:
id new_column
0 1 -0.396238
1 2 0.074007
Is there a more efficient way to do this? I don't see any reason why for 1 element it takes 11 ns but for 60K it takes hours.
Upvotes: 0
Views: 668
Reputation: 309
I generated a similar dataset with 20 million rows and 60K IDs and ran it through your code; it took about an hour to finish. Generally, user-defined functions suffer from the lack of speed as apply()
does not take advantage of Pandas’ vectorization. If executing apply()
with large datasets is a major pain point for you, you should consider alternative solutions such as Bodo. I ran the same code through Bodo; it only took about 1.5 minutes to finish. Essentially, Bodo optimizes your apply()
code to maintain the vectorization provided while providing access to scientifically correct parallelization methods. The community edition of Bodo enables you to run on up to 4 cores. Here is a link to the installation page: https://docs.bodo.ai/latest/source/install.html
#data generation
import pandas as pd
import numpy as np
import time
df = pd.DataFrame({'id' : np.random.randint(1,60000,20000000),
'something' : np.random.randn(20000000)})
my_table = pd.DataFrame({'id' : np.arange(1, 60000)})
my_table.to_parquet("table.pq")
df.to_parquet("df.pq")
With Pandas (I did some minor changes in your code to make it more robust):
def find_something(id,n,df):
df = df.loc[(df['id'] == id)]
if len(df) != 0:
result = df['something'].iloc[n-1]
else:
result = np.nan
return result
start = time.time()
df = pd.read_parquet("df.pq")
my_table = pd.read_parquet("table.pq")
my_table['new_column'] = my_table['id'].apply(find_something, args=(1,df,))
end = time.time()
print("computation time: ", end - start)
print(my_table.head())
output:
computation time: 3482.743801832199
id new_column
0 1 -1.096224
1 2 0.667792
2 3 1.069627
3 4 0.129955
4 5 0.150882
With Bodo:
%%px
import pandas as pd
import numpy as np
import time
import bodo
@bodo.jit(distributed = ['df', 'result'])
def find_something(id,n,df):
df = df.loc[(df['id'] == id)]
if len(df) != 0:
result = df['something'].iloc[n-1]
else:
result = np.nan
return result
@bodo.jit(distributed = ['my_table', 'df'])
def new_column():
start = time.time()
df = pd.read_parquet("df.pq")
my_table = pd.read_parquet("table.pq")
my_table['new_column'] = my_table['id'].apply(find_something, args=(1,df,))
end = time.time()
print("computation time: ", end - start)
print(my_table.head())
return my_table
my_table = new_column()
output:
[stdout:0]
computation time: 103.9169020652771
id new_column
0 1 -1.096224
1 2 0.667792
2 3 1.069627
3 4 0.129955
4 5 0.150882
Disclaimer: I work as a data scientist advocate in Bodo.ai.
Upvotes: 1