Pandas selecting columns - best habit and performance

Question

There are many different ways to select a column in a pandas.DataFrame (same for rows). I am wondering if it makes any difference and if there are any performance and style recommendations.

E.g., if I have a DataFrame as follows:

import pandas as pd
import numpy as np

df = pd.DataFrame(data=np.random.random((10,4)), columns=['a','b','c','d'])
df.head()

enter image description here

There are many different ways to select e.g., column d

1) df['d']
2) df.loc[:,'d'] (where df.loc[row_indexer,column_indexer])
3) df.loc[:]['d']
4) df.ix[:]['d']
5) df.ix[:,'d']

Intuitively, I would prefer 2), maybe because I am used to the [row_indexer,column_indexer] style from numpy

suzanshakya · Accepted Answer

I would use ipython's magic function %timeit to find out the best performant method. The results are:

%timeit df['d']
100000 loops, best of 3: 5.35 µs per loop

%timeit df.loc[:,'d']
10000 loops, best of 3: 44.3 µs per loop

%timeit df.loc[:]['d']
100000 loops, best of 3: 12.4 µs per loop

%timeit df.ix[:]['d']
100000 loops, best of 3: 10.4 µs per loop

%timeit df.ix[:,'d']
10000 loops, best of 3: 53 µs per loop

It turns out that the 1st method is considerably faster than others.

Pandas selecting columns - best habit and performance

Answers (1)

Related Questions