Reputation: 32558

best way to iterate through elements of pandas Series

All of the following seem to be working for iterating through the elements of a pandas Series. I'm sure there's more ways of doing it. What are the differences and which is the best way?

import pandas


arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

# 1
for el in arr:
    print(el)

# 2
for _, el in arr.iteritems():
    print(el)

# 3
for el in arr.array:
    print(el)

# 4
for el in arr.values:
    print(el)

# 5
for i in range(len(arr)):
    print(arr.iloc[i])

Upvotes: 14

Answers (6)

tdy

Reputation: 41427

TL;DR

Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.

However if Series iteration is absolutely necessary, performance will depend on the dtype and index:

Index	Fastest if numpy dtype	Fastest if pandas dtype	Idiomatic
^Unneeded	^{in s.to_numpy()}	^{in s.array}	^{in s}
^Default	^{in enumerate(s.to_numpy())}	^{in enumerate(s.array)}	^{in s.items()}
^Custom	^{in zip(s.index, s.to_numpy())}	^{in s.items()}	^{in s.items()}

For numpy-based Series, use `s.to_numpy()`

If the Series is a python or numpy dtype, it's usually fastest to iterate the underlying numpy ndarray:
```
for el in s.to_numpy(): # if dtype is datetime, int, float, str, string
```
datetime

int float float + nan str string
To access the index, it's actually fastest to enumerate() or zip() the numpy ndarray:
```
for i, el in enumerate(s.to_numpy()): # if default range index
```
```
for i, el in zip(s.index, s.to_numpy()): # if custom index
```
Both are faster than the idiomatic s.items() / s.iteritems():

datetime + index
To micro-optimize, switch to s.tolist() for shorter int/float/str Series:
```
for el in s.to_numpy(): # if >100K elements
```
```
for el in s.tolist(): # to micro-optimize if <100K elements
```
^{Warning: Do not use list(s) as it doesn't use compiled code which makes it slower.}

For pandas-based Series, use `s.array` or `s.items()`

Pandas extension dtypes contain extra (meta)data, e.g.:

pandas dtype	contents
`Categorical`	2 arrays
`DatetimeTZ`	array + timezone metadata
`Interval`	2 arrays
`Period`	array + frequency metadata
...	...

Converting these extension arrays to numpy "may be expensive" since it could involve copying/coercing the data, so:

If the Series is a pandas extension dtype, it's generally fastest to iterate the underlying pandas array:
```
for el in s.array: # if dtype is pandas-only extension
```
For example, with ~100 unique Categorical values:

Categorical

DatetimeTZ Period Interval
To access the index, the idiomatic s.items() is very fast for pandas dtypes:
```
for i, el in s.items(): # if need index for pandas-only dtype
```
DatetimeTZ + index Interval + index Period + index
To micro-optimize, switch to the slightly faster enumerate() for default-indexed Categorical arrays:
```
for i, el in enumerate(s.array): # to micro-optimize Categorical dtype if need default range index
```
Categorical + index

Caveats

Avoid using s.values:
- Use s.to_numpy() to get the underlying numpy ndarray
- Use s.array to get the underlying pandas array
Avoid modifying the iterated Series:

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!
Avoid iterating manually whenever possible by instead:
1. Vectorizing, (boolean) indexing, etc.
2. Applying functions, e.g.:
  ^{Note: These are not vectorizations despite the common misconception.}
3. Offloading to cython/numba

_{Specs: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)}
_{Versions: python==3.9.2, pandas==1.3.1, numpy==1.20.2}
_{Testing data: Series generation code in snippet}

'''
Note: This is python code in a js snippet, so "run code snippet" will not work.
The snippet is just to avoid cluttering the main post with supplemental code.
'''

import pandas as pd
import numpy as np

int_series = pd.Series(np.random.randint(1000000000, size=n))
float_series = pd.Series(np.random.randn(size=n))
floatnan_series = pd.Series(np.random.choice([np.nan, np.inf]*n + np.random.randn(n).tolist(), size=n))
str_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype(str)
string_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype('string')
datetime_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01'), size=n))
datetimetz_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01', tz='CET'), size=n))
categorical_series = pd.Series(np.random.randint(100, size=n)).astype('category')
interval_series = pd.Series(pd.arrays.IntervalArray.from_arrays(-np.random.random(size=n), np.random.random(size=n)))
period_series = pd.Series(pd.period_range(end='2021-01-01', periods=n, freq='s'))

Upvotes: 40

Karn Kumar

Reputation: 8826

I believe, the more important is to understand the requirement over cosmetics while looking around a solution for an individual requirement.

In my opinion, it doesn't cost too much until the data we are working on is huge, where we have to be selective in our approach rest for small dataset either approach will be fine as mentioned below..

There are good explanation in PEP 469, PEP 3106 and Views And Iterators Instead Of Lists

In Python 3, there is only one method named items(). It uses iterators so it is fast and allows traversing the dictionary while editing. Note that the method iteritems() was removed from Python 3.

One can have a look at Python3 Wiki Built-In_Changes to get more details on it.

arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])
$ for index, value in arr.items():
   print(f"Index : {index}, Value : {value}")

Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for index, value in arr.iteritems():
   print(f"Index : {index}, Value : {value}")
   
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for _, value in arr.iteritems():
   print(f"Index : {index}, Value : {value}")

Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 3
Index : 7, Value : 3

$ for i, v in enumerate(arr):
   print(f"Index : {i}, Value : {v}")
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3

$ for value in arr:
   print(value)

1
1
1
2
2
2
3
3



$ for value in arr.tolist():
   print(value)

1
1
1
2
2
2
3
3

There is a good post about How to iterate over rows in a DataFrame in Pandas though it says df but it explains all about item() , iteritems() etc.

Another good discussion over SO items & iteritems.

Upvotes: 1

S2L

Reputation: 1944

For vector programming (pandas, R, octave, ..), it is recommended not to iterate over vectors. Instead, use the library-provided mapping function to apply over a series or dataset.

In your case of applying print function to each element, the code would simply be:

import pandas
arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

arr.apply(print)

Upvotes: 1

Sonia Samipillai

Reputation: 620

Ways to iterate through pandas/python

arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

#Using Python range() method
for i in range(len(arr)):
    print(arr[i])

range doesn’t include the end value in the sequence

#List Comprehension
print([arr[i] for i in range(len(arr))])

List comprehension can work with and can identify whether the input is a list, string or tuple

#Using Python enumerate() method
for el,j in enumerate(arr):
    print(j)
#Using Python NumPy module
import numpy as np
print(np.arange(len(arr)))
for i,j in np.ndenumerate(arr):
    print(j)

enumerate is very widely used as enumerate adds a counter to the list or any other iterable and returns it as an enumerate object by the function. It reduces the overhead of keeping a count of the elements while the iteration operation. You wouldn't require a counter here. You could use np.ndenumerate() to mimic the behavior of enumerate for numpy arrays. For very large n-dimensional lists it is advisable to use numpy.

You also use traditional for Loop and also a while Loop

x=0
while x<len(arr):
    print(arr[x])
    x +=1
    
#Using lambda function
list(map(lambda x:x, arr))

lambda reduces the lines of code and can be used along side filter, reduce or map.

If you want to iterate through rows of dataframe rather than the series, we could use iterrows, itertuple and iteritems. The best way in terms of memory and computation is to use the columns as vectors and performing vector computations using numpy arrays. Loops are super expensive when it comes to bigdata. Its easier and quicker when you make them numpy arrays and work on it.

Upvotes: 2

lazy

Reputation: 784

The test results are as follows: the execution speed of the loop is the slowest. Iterrows () is optimized for the dataframe of pandas, which is significantly improved compared with the direct loop. The apply () method also loops between rows, but it is much more efficient than iterrows because of a series of global optimizations using iterators like python. The vectorization of numpy arrays runs fastest, followed by the vectorization of pandas series. Since vectorization works on the whole sequence at the same time, it can save more time. Numpy uses precompiled C code to optimize at the bottom, and avoids a lot of overhead in the operation of pandas series. Therefore, the operation of numpy arrays is much faster than that of pandas series.

loop: 1.80301690102 
iterrows: 0.724927186966 
apply: 0.645957946777
pandas series: 0.333024024963 
numpy array: 0.260366916656

loop of the list > numpy array > pandas series > apply > iterrows

Upvotes: 2

Scott Boston

Reputation: 153510

Use items:

for i, v in arr.items():
    print(f'index: {i} and value: {v}')

Output:

index: 0 and value: 1
index: 1 and value: 1
index: 2 and value: 1
index: 3 and value: 2
index: 4 and value: 2
index: 5 and value: 2
index: 6 and value: 3
index: 7 and value: 3

Upvotes: 2

best way to iterate through elements of pandas Series

Answers (6)

TL;DR

For numpy-based Series, use s.to_numpy()

For pandas-based Series, use s.array or s.items()

Caveats

Related Questions

For numpy-based Series, use `s.to_numpy()`

For pandas-based Series, use `s.array` or `s.items()`