jason
jason

Reputation: 3512

Pandas: Speed up filtering by creating index?

I wanted to use pandas like SQL for a web app (instead of holding the data in pSQL, just hold it in a pandas DataFrame since the data is just under 1GB and is not changing constantly). If I am doing a look up based on multiple filters on columns (eg. age > x, age < y, income > p, income < q) are there any ways to speed up this filtering? or is it already done below. In SQL one would declare an index on age and income to speed up such a query, I am wondering what is the pandas way of doing this if any.

Upvotes: 5

Views: 5669

Answers (3)

Mohrez
Mohrez

Reputation: 11

When manipulating a 40+ million row dataframe (dfBig), boolean indexing (in a loop) started to really be slow.

Here is some pseudo code that worked for me; basically split dfBig into a dictionary using df.groupby() and use the dictionary entries instead of explicit indexing:

# very slow version:
colValues = dfBig.SomeField1.unique()
indValues = dfBig.SomeField2.unique()
dfResult = ... # set up output dataframe
for col in colValues:
    for idx in indValues:
        boolIndex = (dfBig.SomeField1 == col) & (dfBig.SomeField2 == idx)
        fetchValue = dfBig[boolIndex]   # this takes a looong time
        dfResult.loc[idx, col] = someFuncion(fetchValue)


# much faster version:
dicCol = {x:y for x,y in df.groupby(SomeField1)}
for col,dfCol in dicCol.items():
    dfIdx = {x:y for x,y in dfCol.groupby(SomeField2)}
    for idx, dfIdx in dicIdx.items():
        dfResult.oc[idx, col] = someFunction(dfIdx)

"faster version" made it faster by about two orders of magnitude for the 40mm row dataframe

Upvotes: 1

user1940040
user1940040

Reputation:

Pandas is really just a wrapper around numpy.ndarray.

All the searches are really done using ndarray internal.

 df[(df.age > x) & (y < df.age) & (df.income > p) & (df.income < q)]

should do the trick. But you could speed up the process by using numpy.ndarray directly or by using masked arrays: http://docs.scipy.org/doc/numpy/reference/maskedarray.html

These won't allocate new memory for your newly generated arrays. That means that you don't have the time/CPU overhead of looking and allocating new memory for your "filtered" results and that you won't have the memory overhead induced by the copy itself. (Actually, this is not completely true since the mask has to be stored somewhere, but you still don't have to copy your table values somewhere else in memory)

However, this comes at a cost: masked array are a bit longer to work with since the process has to check if each value in memory is masked. However, if this is just to "filter", this specific access overhead should be unnoticeable (it becomes quite important when one wants to do calculations with the masked array).

EDIT:

For persistent and optimized data access on disk and in memory, there is PyTables which is optimized that way. That said, Pytables as well as Numpy/Pandas were not thought to be used that way.

Upvotes: 3

Andy Hayden
Andy Hayden

Reputation: 375485

The "pandas way" of doing this query is:

df[(x < df.age) & (df.age < y) & (p < df.income) & (df.income < q)]

pandas indexes everything by default (including all columns), so you don't need to explicitly declare beforehand what you are going to be querying.

(I can't say whether this set up would makes sense for your dataset.)

Upvotes: 7

Related Questions