Reputation: 6854
Given a pandas dataset contains 8 million rows and 20 columns.
Program queries the dataset to find an average of certain column.
The average based on selections from other columns
Need a help to get fast response on bundle of 10k queries and reduce the query execution time
Setup :
import pandas as pd
from random import randint
df = pd.DataFrame({'A': ['A1','A2','A3','A4','A5','A2','A2','A1','A4','A4'],
'B': ['BA1','BA2','BA3','BA4','BA5','BA2','BA2','BA1','BA4','BA4'],
'C': ['CA1','CA2','CA3','CA4','CA5','CA2','CA2','CA1','CA4','CA4'],
'D': ['D1','D2','D3','D4','D5','D2','D2','D1','D4','D4'],
'important_col': [randint(1, 9)*100 for x in xrange(10)]})
Dataset example :
A B C D important_col
0 A1 BA1 CA1 D1 400
1 A2 BA2 CA2 D2 500
2 A3 BA3 CA3 D3 100
3 A4 BA4 CA4 D4 100
4 A5 BA5 CA5 D5 400
5 A2 BA2 CA2 D2 900
6 A2 BA2 CA2 D2 100
7 A1 BA1 CA1 D1 300
8 A4 BA4 CA4 D4 800
9 A4 BA4 CA4 D4 100
Queries example :
df[(df['A']== 'A1')]['important_col'].mean()
df[(df['A']== 'A2') & (df['B'] == 'BA2')]['important_col'].mean()
df[(df['A']== 'A4') & (df['C'] == 'CA4') & (df['D'] == 'D4')]['important_col'].mean()
Is indexing will help to solve the problem ? and .loc
or .ix
can be used ? or any other solution ?
Upvotes: 2
Views: 828
Reputation: 8567
The columns A through D could be converted into categories as the values are non-unique and limited.
The sample below is based on the df you provided in your OP.
# Original data frame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
A 10 non-null object
B 10 non-null object
C 10 non-null object
D 10 non-null object
important_col 10 non-null int64
dtypes: int64(1), object(4)
memory usage: 480.0+ bytes
# Convert to category
df['A'] = df.A.astype('category')
df['B'] = df.B.astype('category')
df['C'] = df.C.astype('category')
df['D'] = df.D.astype('category')
# Modified data frame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
A 10 non-null category
B 10 non-null category
C 10 non-null category
D 10 non-null category
important_col 10 non-null int64
dtypes: category(4), int64(1)
memory usage: 360.0 bytes
You should see both a benefit in memory usage (the values are replaced by integers and mapped using a small lookup table) as well as speed when selecting (lookup based on integer values will be faster than the same lookup using string values).
I created a Jupyter notebook to show the improvement of simply converting the columns to categories.
Using a sample of 1.000.000 rows (same structure as the OP defines) and the example queries provided in the OP, there's a memory usage improvement as the size goes from 232.7 MB down to 11.4 MB (that's 95% reduction).
Furthermore, the sample queries show a speed benefit as well:
I did the same test with a 8 mio sample, resulting in the same amount of speed and resource usage improvement.
Upvotes: 3
Reputation: 1366
The answer from @Kristof is a good place to start with. I noted a speedup of just under 2x with that suggestion. Some additional things to keep in mind for large DataFrames is the order of expressions used (eg, do you need to create a new DataFrame to select a Series or can you generate the new Series directly). Also using numpy types directly when the rich Pandas methods are not needed is an option.
Expanding on your example:
In [58]: df_big = pd.DataFrame()
In [59]: for i in range(1000): df_big = df_big.append(df)
In [61]: len(df_big)
Out[61]: 10000
In [62]: dfr = df_big.to_records()
In [63]: dfr
Out[63]:
rec.array([(0, 'A1', 'BA1', 'CA1', 'D1', 900), (1, 'A2', 'BA2', 'CA2', 'D2', 900),
(2, 'A3', 'BA3', 'CA3', 'D3', 500), ...,
(7, 'A1', 'BA1', 'CA1', 'D1', 700), (8, 'A4', 'BA4', 'CA4', 'D4', 300),
(9, 'A4', 'BA4', 'CA4', 'D4', 500)],
dtype=[('index', '<i8'), ('A', '|O'), ('B', '|O'), ('C', '|O'), ('D', '|O'), ('important_col', '<i8')])
In [71]: %timeit df_big[(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')]['important_col'].mean()
100 loops, best of 3: 2.91 ms per loop
In [72]: %timeit df_big['important_col'][(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')].mean()
100 loops, best of 3: 2.46 ms per loop
In [73]: df_big[(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')]['important_col'].mean()
In [74]: %timeit dfr['important_col'][(dfr['A']== 'A4') & (dfr['C'] == 'CA4') & (dfr['D'] == 'D4')].mean()
1000 loops, best of 3: 877 µs per loop
Upvotes: 2