Efficient query pandas dataset

Question

Given a pandas dataset contains 8 million rows and 20 columns.

Program queries the dataset to find an average of certain column.

The average based on selections from other columns

Need a help to get fast response on bundle of 10k queries and reduce the query execution time

Setup :

import pandas as pd
from random import randint


df = pd.DataFrame({'A': ['A1','A2','A3','A4','A5','A2','A2','A1','A4','A4'],
               'B': ['BA1','BA2','BA3','BA4','BA5','BA2','BA2','BA1','BA4','BA4'],
               'C': ['CA1','CA2','CA3','CA4','CA5','CA2','CA2','CA1','CA4','CA4'],
               'D': ['D1','D2','D3','D4','D5','D2','D2','D1','D4','D4'],
               'important_col': [randint(1, 9)*100 for x in xrange(10)]})

Dataset example :

        A    B       C      D   important_col
  0     A1  BA1     CA1     D1  400
  1     A2  BA2     CA2     D2  500
  2     A3  BA3     CA3     D3  100
  3     A4  BA4     CA4     D4  100
  4     A5  BA5     CA5     D5  400
  5     A2  BA2     CA2     D2  900
  6     A2  BA2     CA2     D2  100
  7     A1  BA1     CA1     D1  300
  8     A4  BA4     CA4     D4  800
  9     A4  BA4     CA4     D4  100

Queries example :

df[(df['A']== 'A1')]['important_col'].mean()
df[(df['A']== 'A2') & (df['B'] == 'BA2')]['important_col'].mean()
df[(df['A']== 'A4') & (df['C'] == 'CA4') & (df['D'] == 'D4')]['important_col'].mean()

Is indexing will help to solve the problem ? and .loc or .ix can be used ? or any other solution ?

DocZer&#248; · Accepted Answer

The columns A through D could be converted into categories as the values are non-unique and limited.

The sample below is based on the df you provided in your OP.

# Original data frame
df.info()


RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
A                10 non-null object
B                10 non-null object
C                10 non-null object
D                10 non-null object
important_col    10 non-null int64
dtypes: int64(1), object(4)
memory usage: 480.0+ bytes

# Convert to category
df['A'] = df.A.astype('category')
df['B'] = df.B.astype('category')
df['C'] = df.C.astype('category')
df['D'] = df.D.astype('category')

# Modified data frame
df.info()


RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
A                10 non-null category
B                10 non-null category
C                10 non-null category
D                10 non-null category
important_col    10 non-null int64
dtypes: category(4), int64(1)
memory usage: 360.0 bytes

You should see both a benefit in memory usage (the values are replaced by integers and mapped using a small lookup table) as well as speed when selecting (lookup based on integer values will be faster than the same lookup using string values).

Update

I created a Jupyter notebook to show the improvement of simply converting the columns to categories.

Using a sample of 1.000.000 rows (same structure as the OP defines) and the example queries provided in the OP, there's a memory usage improvement as the size goes from 232.7 MB down to 11.4 MB (that's 95% reduction).

Furthermore, the sample queries show a speed benefit as well:

Query 1: 83% improvement (57 ms > 9.36 ms)
Query 2: 91% improvement (80.9 ms > 6.97 ms)
Query 3: 92% improvement (119 ms > 9.37)

I did the same test with a 8 mio sample, resulting in the same amount of speed and resource usage improvement.

Efficient query pandas dataset

Answers (2)

Update

Related Questions