Pandas .loc super slow in large index dataset

Question

I am new to pandas , so assume I must be missing something obvious...

Summary:

I have a DataFrame with 300K+ rows. I retrieve a row of new data which may or may not be related to the existing subset of rows in the DF(identified by Group ID), either retrieve the existing Group ID or generate new one and finally insert it with the Group ID.

Pandas seems very slow for this.

Please advise : What am I missing / should I be using something else?

Details:

Columns are (example):

columnList =    ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')

Each groupID can have many unique timeStamp's

groupID is internally generated :

Either using an existing one (by matching the row to existing data, say by column 'D')
Generate new groupID

Thus (in my view at least) I cannot do updates/inserts in bulk, I have to do it row by row

I used an SQL DB analogy to create an index as concat of groupID and timeStamp (I Have tried MultiIndex but it seems even slower).

Finally I insert/update using .loc(ind,columnName)

Code:

import pandas as pd
import numpy as np
import time 

columnList =    ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')

columnTypeDict = {'groupID':'int64','timeStamp':'int64'}

startID = 1234567

df = pd.DataFrame(columns=columnList)
df = df.astype(columnTypeDict)

fID = list(range(startID,startID+300000))

df['groupID'] = fID

ts = [1000000000]*150000 + [10000000001]*150000

df['timeStamp'] = ts

indx = [str(i) + str(j) for i, j in zip(fID, ts)] 

df['Index'] = indx
df['Index'] = df['Index'].astype('uint64')
df = df.set_index('Index')

startTime = time.time()

for groupID in range(startID+49000,startID+50000) :

    timeStamp = 1000000003

    # Obtain/generate an index
    ind =int(str(groupID) + str(timeStamp))


    #print(ind)
    df.loc[ind,'A'] = 1


print(df)

print(time.time()-startTime,"secs")

If the index column already exists, its fast, but if it doesn't 10,000 inserts take 140secs

Pandas .loc super slow in large index dataset

Answers (1)

Related Questions