Toren
Toren

Reputation: 6854

Efficient query pandas dataset

Given a pandas dataset contains 8 million rows and 20 columns.

Program queries the dataset to find an average of certain column.

The average based on selections from other columns

Need a help to get fast response on bundle of 10k queries and reduce the query execution time

Setup :

import pandas as pd
from random import randint


df = pd.DataFrame({'A': ['A1','A2','A3','A4','A5','A2','A2','A1','A4','A4'],
               'B': ['BA1','BA2','BA3','BA4','BA5','BA2','BA2','BA1','BA4','BA4'],
               'C': ['CA1','CA2','CA3','CA4','CA5','CA2','CA2','CA1','CA4','CA4'],
               'D': ['D1','D2','D3','D4','D5','D2','D2','D1','D4','D4'],
               'important_col': [randint(1, 9)*100 for x in xrange(10)]})

Dataset example :

        A    B       C      D   important_col
  0     A1  BA1     CA1     D1  400
  1     A2  BA2     CA2     D2  500
  2     A3  BA3     CA3     D3  100
  3     A4  BA4     CA4     D4  100
  4     A5  BA5     CA5     D5  400
  5     A2  BA2     CA2     D2  900
  6     A2  BA2     CA2     D2  100
  7     A1  BA1     CA1     D1  300
  8     A4  BA4     CA4     D4  800
  9     A4  BA4     CA4     D4  100

Queries example :

df[(df['A']== 'A1')]['important_col'].mean()
df[(df['A']== 'A2') & (df['B'] == 'BA2')]['important_col'].mean()
df[(df['A']== 'A4') & (df['C'] == 'CA4') & (df['D'] == 'D4')]['important_col'].mean()

Is indexing will help to solve the problem ? and .loc or .ix can be used ? or any other solution ?

Upvotes: 2

Views: 828

Answers (2)

DocZerø
DocZerø

Reputation: 8567

The columns A through D could be converted into categories as the values are non-unique and limited.

The sample below is based on the df you provided in your OP.

# Original data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
A                10 non-null object
B                10 non-null object
C                10 non-null object
D                10 non-null object
important_col    10 non-null int64
dtypes: int64(1), object(4)
memory usage: 480.0+ bytes

# Convert to category
df['A'] = df.A.astype('category')
df['B'] = df.B.astype('category')
df['C'] = df.C.astype('category')
df['D'] = df.D.astype('category')

# Modified data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
A                10 non-null category
B                10 non-null category
C                10 non-null category
D                10 non-null category
important_col    10 non-null int64
dtypes: category(4), int64(1)
memory usage: 360.0 bytes

You should see both a benefit in memory usage (the values are replaced by integers and mapped using a small lookup table) as well as speed when selecting (lookup based on integer values will be faster than the same lookup using string values).

Update

I created a Jupyter notebook to show the improvement of simply converting the columns to categories.

Using a sample of 1.000.000 rows (same structure as the OP defines) and the example queries provided in the OP, there's a memory usage improvement as the size goes from 232.7 MB down to 11.4 MB (that's 95% reduction).

Furthermore, the sample queries show a speed benefit as well:

  • Query 1: 83% improvement (57 ms > 9.36 ms)
  • Query 2: 91% improvement (80.9 ms > 6.97 ms)
  • Query 3: 92% improvement (119 ms > 9.37)

I did the same test with a 8 mio sample, resulting in the same amount of speed and resource usage improvement.

Upvotes: 3

clocker
clocker

Reputation: 1366

The answer from @Kristof is a good place to start with. I noted a speedup of just under 2x with that suggestion. Some additional things to keep in mind for large DataFrames is the order of expressions used (eg, do you need to create a new DataFrame to select a Series or can you generate the new Series directly). Also using numpy types directly when the rich Pandas methods are not needed is an option.

Expanding on your example:

In [58]: df_big = pd.DataFrame()
In [59]: for i in range(1000): df_big = df_big.append(df)
In [61]: len(df_big)
Out[61]: 10000

In [62]: dfr = df_big.to_records()

In [63]: dfr
Out[63]: 
rec.array([(0, 'A1', 'BA1', 'CA1', 'D1', 900), (1, 'A2', 'BA2', 'CA2', 'D2', 900),
 (2, 'A3', 'BA3', 'CA3', 'D3', 500), ...,
 (7, 'A1', 'BA1', 'CA1', 'D1', 700), (8, 'A4', 'BA4', 'CA4', 'D4', 300),
 (9, 'A4', 'BA4', 'CA4', 'D4', 500)], 
          dtype=[('index', '<i8'), ('A', '|O'), ('B', '|O'), ('C', '|O'), ('D', '|O'), ('important_col', '<i8')])


In [71]: %timeit df_big[(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')]['important_col'].mean() 
100 loops, best of 3: 2.91 ms per loop

In [72]: %timeit df_big['important_col'][(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')].mean()
100 loops, best of 3: 2.46 ms per loop

In [73]: df_big[(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')]['important_col'].mean()

In [74]: %timeit dfr['important_col'][(dfr['A']== 'A4') & (dfr['C'] == 'CA4') & (dfr['D'] == 'D4')].mean()
1000 loops, best of 3: 877 µs per loop

Upvotes: 2

Related Questions