jkokorian
jkokorian

Reputation: 3095

Setting values in pandas Series is slow, why?

Question

Does anyone know why setting an item directly on a pandas series is so incredibly slow? Am I doing something wrong, or is it just the way it is?

I ran a couple of tests to see what the fastest method is to set a value on a pandas Series object. Here are the results, ordered from fast to slow:

initialize array, set using integer index, create series

%%timeit
a = np.empty(1000, dtype='float')
for i in range(len(a)):
    a[i] = 1.0
s = pd.Series(data=a)

1000 loops, best of 3: 630 µs per loop

create empty list, add item using append, create series

%%timeit
l = []
for i in range(1000):
    l.append(1.0)
s = pd.Series(data=l)

1000 loops, best of 3: 1.05 ms per loop

initialize array, create series, set using set_value

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.set_value(i, 1.0)

100 loops, best of 3: 18.5 ms per loop

initialize array, create series, set using integer index

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s[i] = 1.0

10 loops, best of 3: 30.2 ms per loop

intialize array, create series, set using iat

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iat[i] = 1.0

10 loops, best of 3: 36.2 ms per loop

initialize array, create series, set using iloc

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iloc[i] = 1.0

1 loops, best of 3: 280 ms per loop

Upvotes: 9

Views: 10410

Answers (3)

jkokorian
jkokorian

Reputation: 3095

I figured out how to get past the indexing overhead when setting values on a series object directly:

a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    a[i] = 1.0

When initializing the Series from a numpy array, the data is not copied. If a reference is kept to the original array, you can just set values on that!

Upvotes: 2

Alexander
Alexander

Reputation: 109626

I think these methods are even faster for initializing a series to a constant value:

Base Line

%%timeit
a = np.empty(1000, dtype='float')
for i in range(len(a)):
    a[i] = 1.0
s = pd.Series(data=a)

10000 loops, best of 3: 121 µs per loop

Alternatives

%%timeit
s = pd.Series(np.empty(1000, dtype='float')) * 1.

10000 loops, best of 3: 99.5 µs per loop

%%timeit
constant = 5.
s = pd.Series(np.ones(1000)) * constant

10000 loops, best of 3: 85.3 µs per loop

Upvotes: 1

EdChum
EdChum

Reputation: 394159

From the docs

Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for.

So I get the following which should be comparable:

In [13]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iat[i] = 1.0
10 loops, best of 3: 23.3 ms per loop
In [14]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iloc[i] = 1.0
10 loops, best of 3: 159 ms per loop

for the other tests:

In [15]:

%%timeit
l = []
for i in range(1000):
    l.append(1.0)
s = pd.Series(data=l)
1000 loops, best of 3: 525 µs per loop
In [16]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.set_value(i,1.0)
100 loops, best of 3: 10.1 ms per loop
In [17]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s[i] = 1.0
100 loops, best of 3: 17.5 ms per loop

Upvotes: 5

Related Questions