FuriousTurtle
FuriousTurtle

Reputation: 11

Pandas .loc super slow in large index dataset

I am new to pandas , so assume I must be missing something obvious...

Summary:

I have a DataFrame with 300K+ rows. I retrieve a row of new data which may or may not be related to the existing subset of rows in the DF(identified by Group ID), either retrieve the existing Group ID or generate new one and finally insert it with the Group ID.

Pandas seems very slow for this.

Please advise : What am I missing / should I be using something else?

Details:

Columns are (example):

columnList =    ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')

Each groupID can have many unique timeStamp's

groupID is internally generated :

  1. Either using an existing one (by matching the row to existing data, say by column 'D')
  2. Generate new groupID

Thus (in my view at least) I cannot do updates/inserts in bulk, I have to do it row by row

I used an SQL DB analogy to create an index as concat of groupID and timeStamp (I Have tried MultiIndex but it seems even slower).

Finally I insert/update using .loc(ind,columnName)

Code:

import pandas as pd
import numpy as np
import time 

columnList =    ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')

columnTypeDict = {'groupID':'int64','timeStamp':'int64'}

startID = 1234567

df = pd.DataFrame(columns=columnList)
df = df.astype(columnTypeDict)

fID = list(range(startID,startID+300000))

df['groupID'] = fID

ts = [1000000000]*150000 + [10000000001]*150000

df['timeStamp'] = ts

indx = [str(i) + str(j) for i, j in zip(fID, ts)] 

df['Index'] = indx
df['Index'] = df['Index'].astype('uint64')
df = df.set_index('Index')

startTime = time.time()

for groupID in range(startID+49000,startID+50000) :

    timeStamp = 1000000003

    # Obtain/generate an index
    ind =int(str(groupID) + str(timeStamp))


    #print(ind)
    df.loc[ind,'A'] = 1


print(df)

print(time.time()-startTime,"secs")

If the index column already exists, its fast, but if it doesn't 10,000 inserts take 140secs

Upvotes: 1

Views: 920

Answers (1)

Mahamadou
Mahamadou

Reputation: 797

I think accessing dataframes is a relatively expensive operation. You can save temporatily these values and use them to create dataframe that will be merged with the original one as follows:

startTime = time.time()

temporary_idx = []
temporary_values = []

for groupID in range(startID+49000,startID+50000) :

    timeStamp = 1000000003

    # Obtain/generate an index
    ind = int(str(groupID) + str(timeStamp))
    temporary_idx.append(ind)
    temporary_values.append(1)

# create a dataframe with new values and apply a join with the original dataframe
df = df.drop(columns=["A"])\
    .merge(
        pd.DataFrame({"A": temporary_values}, index=temporary_idx).rename_axis("Index", axis="index"),
        how="outer", right_index=True, left_index=True
    )
print(df)
print(time.time()-startTime,"secs")

When I benchmarked, This takes less than 2 seconds to execute

I don't know what is exactly your real use case, but this for the case of inserting column A as you stated in your example. If your use case is more complex than that, then there might be a better solution

Upvotes: 1

Related Questions