Reputation: 32558
All of the following seem to be working for iterating through the elements of a pandas Series. I'm sure there's more ways of doing it. What are the differences and which is the best way?
import pandas
arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])
# 1
for el in arr:
print(el)
# 2
for _, el in arr.iteritems():
print(el)
# 3
for el in arr.array:
print(el)
# 4
for el in arr.values:
print(el)
# 5
for i in range(len(arr)):
print(arr.iloc[i])
Upvotes: 14
Views: 25563
Reputation: 41427
Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.
However if Series iteration is absolutely necessary, performance will depend on the dtype and index:
Index | Fastest if numpy dtype | Fastest if pandas dtype | Idiomatic |
---|---|---|---|
Unneeded | in s.to_numpy() |
in s.array |
in s |
Default | in enumerate(s.to_numpy()) |
in enumerate(s.array) |
in s.items() |
Custom | in zip(s.index, s.to_numpy()) |
in s.items() |
in s.items() |
s.to_numpy()
If the Series is a python or numpy dtype, it's usually fastest to iterate the underlying numpy ndarray:
for el in s.to_numpy(): # if dtype is datetime, int, float, str, string
To access the index, it's actually fastest to enumerate()
or zip()
the numpy ndarray:
for i, el in enumerate(s.to_numpy()): # if default range index
for i, el in zip(s.index, s.to_numpy()): # if custom index
Both are faster than the idiomatic s.items()
/ s.iteritems()
:
To micro-optimize, switch to s.tolist()
for shorter int
/float
/str
Series:
for el in s.to_numpy(): # if >100K elements
for el in s.tolist(): # to micro-optimize if <100K elements
Warning: Do not use list(s)
as it doesn't use compiled code which makes it slower.
s.array
or s.items()
Pandas extension dtypes contain extra (meta)data, e.g.:
pandas dtype | contents |
---|---|
Categorical |
2 arrays |
DatetimeTZ |
array + timezone metadata |
Interval |
2 arrays |
Period |
array + frequency metadata |
... | ... |
Converting these extension arrays to numpy "may be expensive" since it could involve copying/coercing the data, so:
If the Series is a pandas extension dtype, it's generally fastest to iterate the underlying pandas array:
for el in s.array: # if dtype is pandas-only extension
For example, with ~100 unique Categorical
values:
To access the index, the idiomatic s.items()
is very fast for pandas dtypes:
for i, el in s.items(): # if need index for pandas-only dtype
To micro-optimize, switch to the slightly faster enumerate()
for default-indexed Categorical
arrays:
for i, el in enumerate(s.array): # to micro-optimize Categorical dtype if need default range index
s.to_numpy()
to get the underlying numpy ndarrays.array
to get the underlying pandas arrayAvoid modifying the iterated Series:
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!
Avoid iterating manually whenever possible by instead:
Vectorizing, (boolean) indexing, etc.
Applying functions, e.g.:
Note: These are not vectorizations despite the common misconception.
Specs: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)
Versions: python==3.9.2
, pandas==1.3.1
, numpy==1.20.2
Testing data: Series generation code in snippet
'''
Note: This is python code in a js snippet, so "run code snippet" will not work.
The snippet is just to avoid cluttering the main post with supplemental code.
'''
import pandas as pd
import numpy as np
int_series = pd.Series(np.random.randint(1000000000, size=n))
float_series = pd.Series(np.random.randn(size=n))
floatnan_series = pd.Series(np.random.choice([np.nan, np.inf]*n + np.random.randn(n).tolist(), size=n))
str_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype(str)
string_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype('string')
datetime_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01'), size=n))
datetimetz_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01', tz='CET'), size=n))
categorical_series = pd.Series(np.random.randint(100, size=n)).astype('category')
interval_series = pd.Series(pd.arrays.IntervalArray.from_arrays(-np.random.random(size=n), np.random.random(size=n)))
period_series = pd.Series(pd.period_range(end='2021-01-01', periods=n, freq='s'))
Upvotes: 40
Reputation: 8826
I believe, the more important is to understand the requirement over cosmetics while looking around a solution for an individual requirement.
In my opinion, it doesn't cost too much until the data we are working on is huge, where we have to be selective in our approach rest for small dataset either approach will be fine as mentioned below..
There are good explanation in PEP 469, PEP 3106 and Views And Iterators Instead Of Lists
In Python 3, there is only one method named items(). It uses iterators so it is fast and allows traversing the dictionary while editing. Note that the method iteritems() was removed from Python 3.
One can have a look at Python3 Wiki Built-In_Changes to get more details on it.
arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])
$ for index, value in arr.items():
print(f"Index : {index}, Value : {value}")
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3
$ for index, value in arr.iteritems():
print(f"Index : {index}, Value : {value}")
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3
$ for _, value in arr.iteritems():
print(f"Index : {index}, Value : {value}")
Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 1
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 2
Index : 7, Value : 3
Index : 7, Value : 3
$ for i, v in enumerate(arr):
print(f"Index : {i}, Value : {v}")
Index : 0, Value : 1
Index : 1, Value : 1
Index : 2, Value : 1
Index : 3, Value : 2
Index : 4, Value : 2
Index : 5, Value : 2
Index : 6, Value : 3
Index : 7, Value : 3
$ for value in arr:
print(value)
1
1
1
2
2
2
3
3
$ for value in arr.tolist():
print(value)
1
1
1
2
2
2
3
3
There is a good post about How to iterate over rows in a DataFrame in Pandas though it says df but it explains all about item()
, iteritems()
etc.
Another good discussion over SO items & iteritems.
Upvotes: 1
Reputation: 1944
For vector programming (pandas, R, octave, ..), it is recommended not to iterate over vectors. Instead, use the library-provided mapping function to apply over a series or dataset.
In your case of applying print function to each element, the code would simply be:
import pandas
arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])
arr.apply(print)
Upvotes: 1
Reputation: 620
Ways to iterate through pandas/python
arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])
#Using Python range() method
for i in range(len(arr)):
print(arr[i])
range doesn’t include the end value in the sequence
#List Comprehension
print([arr[i] for i in range(len(arr))])
List comprehension can work with and can identify whether the input is a list, string or tuple
#Using Python enumerate() method
for el,j in enumerate(arr):
print(j)
#Using Python NumPy module
import numpy as np
print(np.arange(len(arr)))
for i,j in np.ndenumerate(arr):
print(j)
enumerate is very widely used as enumerate adds a counter to the list or any other iterable and returns it as an enumerate object by the function. It reduces the overhead of keeping a count of the elements while the iteration operation. You wouldn't require a counter here. You could use np.ndenumerate() to mimic the behavior of enumerate for numpy arrays. For very large n-dimensional lists it is advisable to use numpy.
You also use traditional for Loop and also a while Loop
x=0
while x<len(arr):
print(arr[x])
x +=1
#Using lambda function
list(map(lambda x:x, arr))
lambda reduces the lines of code and can be used along side filter, reduce or map.
If you want to iterate through rows of dataframe rather than the series, we could use iterrows, itertuple and iteritems. The best way in terms of memory and computation is to use the columns as vectors and performing vector computations using numpy arrays. Loops are super expensive when it comes to bigdata. Its easier and quicker when you make them numpy arrays and work on it.
Upvotes: 2
Reputation: 784
The test results are as follows: the execution speed of the loop is the slowest. Iterrows () is optimized for the dataframe of pandas, which is significantly improved compared with the direct loop. The apply () method also loops between rows, but it is much more efficient than iterrows because of a series of global optimizations using iterators like python. The vectorization of numpy arrays runs fastest, followed by the vectorization of pandas series. Since vectorization works on the whole sequence at the same time, it can save more time. Numpy uses precompiled C code to optimize at the bottom, and avoids a lot of overhead in the operation of pandas series. Therefore, the operation of numpy arrays is much faster than that of pandas series.
loop: 1.80301690102
iterrows: 0.724927186966
apply: 0.645957946777
pandas series: 0.333024024963
numpy array: 0.260366916656
loop of the list > numpy array > pandas series > apply > iterrows
Upvotes: 2
Reputation: 153510
Use items
:
for i, v in arr.items():
print(f'index: {i} and value: {v}')
Output:
index: 0 and value: 1
index: 1 and value: 1
index: 2 and value: 1
index: 3 and value: 2
index: 4 and value: 2
index: 5 and value: 2
index: 6 and value: 3
index: 7 and value: 3
Upvotes: 2