Reputation: 131228

How can I iterate over rows in a Pandas DataFrame?

I have a pandas dataframe, df:

How do I iterate over the rows of this dataframe? For every row, I want to access its elements (values in cells) by the name of the columns. For example:

for row in df.rows:
    print(row['c1'], row['c2'])

I found a similar question, which suggests using either of these:

```
for date, row in df.T.iteritems():
```
```
for row in df.iterrows():
```

But I do not understand what the row object is and how I can work with it.

Upvotes: 4210

Answers (30)

Gabriel Staples

Reputation: 53055

Key takeaways:

Use vectorization.
Speed profile your code! Don't assume something is faster because you think it is faster; speed profile it and prove it is faster. The results may surprise you. To get high-precision timestamps in Python, see my answer here: High-precision clock in Python.

How to iterate over Pandas `DataFrame`s without iterating

After several weeks of working on this answer, here's what I've come up with:

Here are 13 techniques for iterating over Pandas DataFrames. As you can see, the time it takes varies dramatically. The fastest technique is ~1363x faster than the slowest technique! The key takeaway, as @cs95 says here, is don't iterate! Use vectorization ("array programming") instead. All this really means is that you should use the arrays directly in mathematical formulas rather than trying to manually iterate over the arrays. The underlying objects must support this, of course, but both Numpy and Pandas do.

There are many ways to use vectorization in Pandas, which you can see in the plot and in my example code below. When using the arrays directly, the underlying looping still takes place, but in (I think) very optimized underlying C code rather than through raw Python.

Results

13 techniques, numbered 1 to 13, were tested. The technique number and name is underneath each bar. The total calculation time is above each bar. Underneath that is the multiplier to show how much longer it took than the fastest technique to the far right:

_{From pandas_dataframe_iteration_vs_vectorization_vs_list_comprehension_speed_tests.svg in my eRCaGuy_hello_world repo (produced by this code).}

Summary

List comprehension and vectorization (possibly with boolean indexing) are all you really need.

Use list comprehension (good) and vectorization (best). Pure vectorization I think is always possible, but may take extra work in complicated calculations. Search this answer for "boolean indexing", "boolean array", and "boolean mask" (all three are the same thing) to see some of the more complicated cases where pure vectorization can thereby be used.

Here are the 13 techniques, listed in order of fastest first to slowest last. I recommend never using the last (slowest) 3 to 4 techniques.

Technique 8: 8_pure_vectorization__with_df.loc[]_boolean_array_indexing_for_if_statment_corner_case
Technique 6: 6_vectorization__with_apply_for_if_statement_corner_case
Technique 7: 7_vectorization__with_list_comprehension_for_if_statment_corner_case
Technique 11: 11_list_comprehension_w_zip_and_direct_variable_assignment_calculated_in_place
Technique 10: 10_list_comprehension_w_zip_and_direct_variable_assignment_passed_to_func
Technique 12: 12_list_comprehension_w_zip_and_row_tuple_passed_to_func
Technique 5: 5_itertuples_in_for_loop
Technique 13: 13_list_comprehension_w__to_numpy__and_direct_variable_assignment_passed_to_func
Technique 9: 9_apply_function_with_lambda
Technique 1: 1_raw_for_loop_using_regular_df_indexing
Technique 2: 2_raw_for_loop_using_df.loc[]_indexing
Technique 4: 4_iterrows_in_for_loop
Technique 3: 3_raw_for_loop_using_df.iloc[]_indexing

Rules of thumb:

Techniques 3, 4, and 2 should never be used. They are super slow and have no advantages whatsoever. Keep in mind though: it's not the indexing technique, such as .loc[] or .iloc[] that makes these techniques bad, but rather, it's the for loop they are in that makes them bad! I use .loc[] inside the fastest (pure vectorization) approach, for instance! So, here are the 3 slowest techniques which should never be used:
1. 3_raw_for_loop_using_df.iloc[]_indexing
2. 4_iterrows_in_for_loop
3. 2_raw_for_loop_using_df.loc[]_indexing
Technique 1_raw_for_loop_using_regular_df_indexing should never be used either, but if you're going to use a raw for loop, it's faster than the others.
The .apply() function (9_apply_function_with_lambda) is ok, but generally speaking, I'd avoid it too. Technique 6_vectorization__with_apply_for_if_statement_corner_case did perform better than 7_vectorization__with_list_comprehension_for_if_statment_corner_case, however, which is interesting.
List comprehension is great! It's not the fastest, but it is easy to use and very fast!
1. The nice thing about it is that it can be used with any function that is intended to work on individual values, or array values. And this means you could have really complicated if statements and things inside the function. So, the tradeoff here is that it gives you great versatility with really readable and re-usable code by using external calculation functions, while still giving you great speed!
Vectorization is the fastest and best, and what you should use whenever the equation is simple. You can optionally use something like .apply() or list comprehension on just the more-complicated portions of the equation, while still easily using vectorization for the rest.
Pure vectorization is the absolute fastest and best, and what you should use if you are willing to put in the effort to make it work.
1. For simple cases, it's what you should use.
2. For complicated cases, if statements, etc., pure vectorization can be made to work too, through boolean indexing, but can add extra work and can decrease readability to do so. So, you can optionally use list comprehension (usually the best) or .apply() (generally slower, but not always) for just those edge cases instead, while still using vectorization for the rest of the calculation. Ex: see techniques 7_vectorization__with_list_comprehension_for_if_statment_corner_case and 6_vectorization__with_apply_for_if_statement_corner_case.

The test data

Assume we have the following Pandas DataFrame. It has 2 million rows with 4 columns (A, B, C, and D), each with random values from -1000 to 1000:

df =
           A    B    C    D
0       -365  842  284 -942
1        532  416 -102  888
2        397  321 -296 -616
3       -215  879  557  895
4        857  701 -157  480
...      ...  ...  ...  ...
1999995 -101 -233 -377 -939
1999996 -989  380  917  145
1999997 -879  333 -372 -970
1999998  738  982 -743  312
1999999 -306 -103  459  745

I produced this DataFrame like this:

import numpy as np
import pandas as pd

# Create an array (numpy list of lists) of fake data
MIN_VAL = -1000
MAX_VAL = 1000
# NUM_ROWS = 10_000_000
NUM_ROWS = 2_000_000  # default for final tests
# NUM_ROWS = 1_000_000
# NUM_ROWS = 100_000
# NUM_ROWS = 10_000  # default for rapid development & initial tests
NUM_COLS = 4
data = np.random.randint(MIN_VAL, MAX_VAL, size=(NUM_ROWS, NUM_COLS))

# Now convert it to a Pandas DataFrame with columns named "A", "B", "C", and "D"
df_original = pd.DataFrame(data, columns=["A", "B", "C", "D"])
print(f"df = \n{df_original}")

The test equation/calculation

I wanted to demonstrate that all of these techniques are possible on non-trivial functions or equations, so I intentionally made the equation they are calculating require:

if statements
data from multiple columns in the DataFrame
data from multiple rows in the DataFrame

The equation we will be calculating for each row is this. I arbitrarily made it up, but I think it contains enough complexity that you will be able to expand on what I've done to perform any equation you want in Pandas with full vectorization:

In Python, the above equation can be written like this:

# Calculate and return a new value, `val`, by performing the following equation:
val = (
    2 * A_i_minus_2
    + 3 * A_i_minus_1
    + 4 * A
    + 5 * A_i_plus_1
    # Python ternary operator; don't forget parentheses around the entire
    # ternary expression!
    + ((6 * B) if B > 0 else (60 * B))
    + 7 * C
    - 8 * D
)

Alternatively, you could write it like this:

# Calculate and return a new value, `val`, by performing the following equation:

if B > 0:
    B_new = 6 * B
else:
    B_new = 60 * B

val = (
    2 * A_i_minus_2
    + 3 * A_i_minus_1
    + 4 * A
    + 5 * A_i_plus_1
    + B_new
    + 7 * C
    - 8 * D
)

Either of those can be wrapped into a function. Ex:

def calculate_val(
        A_i_minus_2,
        A_i_minus_1,
        A,
        A_i_plus_1,
        B,
        C,
        D):
    val = (
        2 * A_i_minus_2
        + 3 * A_i_minus_1
        + 4 * A
        + 5 * A_i_plus_1
        # Python ternary operator; don't forget parentheses around the
        # entire ternary expression!
        + ((6 * B) if B > 0 else (60 * B))
        + 7 * C
        - 8 * D
    )
    return val

The techniques

The full code is available to download and run in my python/pandas_dataframe_iteration_vs_vectorization_vs_list_comprehension_speed_tests.py file in my eRCaGuy_hello_world repo.

Here is the code for all 13 techniques:

Technique 1: 1_raw_for_loop_using_regular_df_indexing

val = [np.NAN]*len(df)
for i in range(len(df)):
    if i < 2 or i > len(df)-2:
        continue

    val[i] = calculate_val(
        df["A"][i-2],
        df["A"][i-1],
        df["A"][i],
        df["A"][i+1],
        df["B"][i],
        df["C"][i],
        df["D"][i],
    )
df["val"] = val  # put this column back into the dataframe

Technique 2: 2_raw_for_loop_using_df.loc[]_indexing

val = [np.NAN]*len(df)
for i in range(len(df)):
    if i < 2 or i > len(df)-2:
        continue

    val[i] = calculate_val(
        df.loc[i-2, "A"],
        df.loc[i-1, "A"],
        df.loc[i,   "A"],
        df.loc[i+1, "A"],
        df.loc[i,   "B"],
        df.loc[i,   "C"],
        df.loc[i,   "D"],
    )

df["val"] = val  # put this column back into the dataframe

Technique 3: 3_raw_for_loop_using_df.iloc[]_indexing

# column indices
i_A = 0
i_B = 1
i_C = 2
i_D = 3

val = [np.NAN]*len(df)
for i in range(len(df)):
    if i < 2 or i > len(df)-2:
        continue

    val[i] = calculate_val(
        df.iloc[i-2, i_A],
        df.iloc[i-1, i_A],
        df.iloc[i,   i_A],
        df.iloc[i+1, i_A],
        df.iloc[i,   i_B],
        df.iloc[i,   i_C],
        df.iloc[i,   i_D],
    )

df["val"] = val  # put this column back into the dataframe

Technique 4: 4_iterrows_in_for_loop

val = [np.NAN]*len(df)
for index, row in df.iterrows():
    if index < 2 or index > len(df)-2:
        continue

    val[index] = calculate_val(
        df["A"][index-2],
        df["A"][index-1],
        row["A"],
        df["A"][index+1],
        row["B"],
        row["C"],
        row["D"],
    )

df["val"] = val  # put this column back into the dataframe

For all of the next examples, we must first prepare the dataframe by adding columns with previous and next values: A_(i-2), A_(i-1), and A_(i+1). These columns in the DataFrame will be named A_i_minus_2, A_i_minus_1, and A_i_plus_1, respectively:

df_original["A_i_minus_2"] = df_original["A"].shift(2)  # val at index i-2
df_original["A_i_minus_1"] = df_original["A"].shift(1)  # val at index i-1
df_original["A_i_plus_1"] = df_original["A"].shift(-1)  # val at index i+1

# Note: to ensure that no partial calculations are ever done with rows which
# have NaN values due to the shifting, we can either drop such rows with
# `.dropna()`, or set all values in these rows to NaN. I'll choose the latter
# so that the stats that will be generated with the techniques below will end
# up matching the stats which were produced by the prior techniques above. ie:
# the number of rows will be identical to before.
#
# df_original = df_original.dropna()
df_original.iloc[:2, :] = np.NAN   # slicing operators: first two rows,
                                   # all columns
df_original.iloc[-1:, :] = np.NAN  # slicing operators: last row, all columns

Running the vectorized code just above to produce those 3 new columns took a total of 0.044961 seconds.

Now on to the rest of the techniques:

Technique 5: 5_itertuples_in_for_loop

# pre-allocate a `val` array of the appropriate size
val = [np.NAN]*len(df)
# Now iterate over all rows in the dataframe, and populate `val`
for row in df.itertuples():
    val[row.Index] = calculate_val(
        row.A_i_minus_2,
        row.A_i_minus_1,
        row.A,
        row.A_i_plus_1,
        row.B,
        row.C,
        row.D,
    )

df["val"] = val  # put this column back into the dataframe

Technique 6: 6_vectorization__with_apply_for_if_statement_corner_case

def calculate_new_column_b_value(b_value):
    # Python ternary operator
    b_value_new = (6 * b_value) if b_value > 0 else (60 * b_value)
    return b_value_new

# In this particular example, since we have an embedded `if-else` statement
# for the `B` column, pure vectorization is less intuitive. So, first we'll
# calculate a new `B` column using
# **`apply()`**, then we'll use vectorization for the rest.
df["B_new"] = df["B"].apply(calculate_new_column_b_value)
# OR (same thing, but with a lambda function instead)
# df["B_new"] = df["B"].apply(lambda x: (6 * x) if x > 0 else (60 * x))

# Now we can use vectorization for the rest. "Vectorization" in this case
# means to simply use the column series variables in equations directly,
# without manually iterating over them. Pandas DataFrames will handle the
# underlying iteration automatically for you. You just focus on the math.
df["val"] = (
    2 * df["A_i_minus_2"]
    + 3 * df["A_i_minus_1"]
    + 4 * df["A"]
    + 5 * df["A_i_plus_1"]
    + df["B_new"]
    + 7 * df["C"]
    - 8 * df["D"]
)

Technique 7: 7_vectorization__with_list_comprehension_for_if_statment_corner_case

# In this particular example, since we have an embedded `if-else` statement
# for the `B` column, pure vectorization is less intuitive. So, first we'll
# calculate a new `B` column using **list comprehension**, then we'll use
# vectorization for the rest.
df["B_new"] = [
    calculate_new_column_b_value(b_value) for b_value in df["B"]
]

# Now we can use vectorization for the rest. "Vectorization" in this case
# means to simply use the column series variables in equations directly,
# without manually iterating over them. Pandas DataFrames will handle the
# underlying iteration automatically for you. You just focus on the math.
df["val"] = (
    2 * df["A_i_minus_2"]
    + 3 * df["A_i_minus_1"]
    + 4 * df["A"]
    + 5 * df["A_i_plus_1"]
    + df["B_new"]
    + 7 * df["C"]
    - 8 * df["D"]
)

Technique 8: 8_pure_vectorization__with_df.loc[]_boolean_array_indexing_for_if_statment_corner_case

This uses boolean indexing, AKA: a boolean mask, to accomplish the equivalent of the if statement in the equation. In this way, pure vectorization can be used for the entire equation, thereby maximizing performance and speed.

# If statement to evaluate:
#
#     if B > 0:
#         B_new = 6 * B
#     else:
#         B_new = 60 * B
#
# In this particular example, since we have an embedded `if-else` statement
# for the `B` column, we can use some boolean array indexing through
# `df.loc[]` for some pure vectorization magic.
#
# Explanation:
#
# Long:
#
# The format is: `df.loc[rows, columns]`, except in this case, the rows are
# specified by a "boolean array" (AKA: a boolean expression, list of
# booleans, or "boolean mask"), specifying all rows where `B` is > 0. Then,
# only in that `B` column for those rows, set the value accordingly. After
# we do this for where `B` is > 0, we do the same thing for where `B`
# is <= 0, except with the other equation.
#
# Short:
#
# For all rows where the boolean expression applies, set the column value
# accordingly.
#
# GitHub CoPilot first showed me this `.loc[]` technique.
# See also the official documentation:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
#
# ===========================
# 1st: handle the > 0 case
# ===========================
df["B_new"] = df.loc[df["B"] > 0, "B"] * 6
#
# ===========================
# 2nd: handle the <= 0 case, merging the results into the
# previously-created "B_new" column
# ===========================
# - NB: this does NOT work; it overwrites and replaces the whole "B_new"
#   column instead:
#
#       df["B_new"] = df.loc[df["B"] <= 0, "B"] * 60
#
# This works:
df.loc[df["B"] <= 0, "B_new"] = df.loc[df["B"] <= 0, "B"] * 60

# Now use normal vectorization for the rest.
df["val"] = (
    2 * df["A_i_minus_2"]
    + 3 * df["A_i_minus_1"]
    + 4 * df["A"]
    + 5 * df["A_i_plus_1"]
    + df["B_new"]
    + 7 * df["C"]
    - 8 * df["D"]
)

Technique 9: 9_apply_function_with_lambda

df["val"] = df.apply(
    lambda row: calculate_val(
        row["A_i_minus_2"],
        row["A_i_minus_1"],
        row["A"],
        row["A_i_plus_1"],
        row["B"],
        row["C"],
        row["D"]
    ),
    axis='columns' # same as `axis=1`: "apply function to each row",
                   # rather than to each column
)

Technique 10: 10_list_comprehension_w_zip_and_direct_variable_assignment_passed_to_func

df["val"] = [
    # Note: you *could* do the calculations directly here instead of using a
    # function call, so long as you don't have indented code blocks such as
    # sub-routines or multi-line if statements.
    #
    # I'm using a function call.
    calculate_val(
        A_i_minus_2,
        A_i_minus_1,
        A,
        A_i_plus_1,
        B,
        C,
        D
    ) for A_i_minus_2, A_i_minus_1, A, A_i_plus_1, B, C, D
    in zip(
        df["A_i_minus_2"],
        df["A_i_minus_1"],
        df["A"],
        df["A_i_plus_1"],
        df["B"],
        df["C"],
        df["D"]
    )
]

Technique 11: 11_list_comprehension_w_zip_and_direct_variable_assignment_calculated_in_place

df["val"] = [
    2 * A_i_minus_2
    + 3 * A_i_minus_1
    + 4 * A
    + 5 * A_i_plus_1
    # Python ternary operator; don't forget parentheses around the entire
    # ternary expression!
    + ((6 * B) if B > 0 else (60 * B))
    + 7 * C
    - 8 * D
    for A_i_minus_2, A_i_minus_1, A, A_i_plus_1, B, C, D
    in zip(
        df["A_i_minus_2"],
        df["A_i_minus_1"],
        df["A"],
        df["A_i_plus_1"],
        df["B"],
        df["C"],
        df["D"]
    )
]

Technique 12: 12_list_comprehension_w_zip_and_row_tuple_passed_to_func

df["val"] = [
    calculate_val(
        row[0],
        row[1],
        row[2],
        row[3],
        row[4],
        row[5],
        row[6],
    ) for row
    in zip(
        df["A_i_minus_2"],
        df["A_i_minus_1"],
        df["A"],
        df["A_i_plus_1"],
        df["B"],
        df["C"],
        df["D"]
    )
]

Technique 13: 13_list_comprehension_w__to_numpy__and_direct_variable_assignment_passed_to_func

df["val"] = [
    # Note: you *could* do the calculations directly here instead of using a
    # function call, so long as you don't have indented code blocks such as
    # sub-routines or multi-line if statements.
    #
    # I'm using a function call.
    calculate_val(
        A_i_minus_2,
        A_i_minus_1,
        A,
        A_i_plus_1,
        B,
        C,
        D
    ) for A_i_minus_2, A_i_minus_1, A, A_i_plus_1, B, C, D
        # Note: this `[[...]]` double-bracket indexing is used to select a
        # subset of columns from the dataframe. The inner `[]` brackets
        # create a list from the column names within them, and the outer
        # `[]` brackets accept this list to index into the dataframe and
        # select just this list of columns, in that order.
        # - See the official documentation on it here:
        #   https://pandas.pydata.org/docs/user_guide/indexing.html#basics
        #   - Search for the phrase "You can pass a list of columns to [] to
        #     select columns in that order."
        #   - I learned this from this comment here:
        #     https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#comment136020567_55557758
        # - One of the **list comprehension** examples in this answer here
        #   uses `.to_numpy()` like this:
        #   https://stackoverflow.com/a/55557758/4561887
    in df[[
        "A_i_minus_2",
        "A_i_minus_1",
        "A",
        "A_i_plus_1",
        "B",
        "C",
        "D"
    ]].to_numpy()  # NB: `.values` works here too, but is deprecated. See:
                   # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html
]

Here are the results again:

Using the pre-shifted rows in the 4 `for` loop techniques as well

I wanted to see if removing this if check and using the pre-shifted rows in the 4 for loop techniques would have much effect:

if i < 2 or i > len(df)-2:
    continue

...so I created this file with those modifications: pandas_dataframe_iteration_vs_vectorization_vs_list_comprehension_speed_tests_mod.py. Search the file for "MOD:" to find the 4 new, modified techniques.

It had only a slight improvement. Here are the results of these 17 techniques now, with the 4 new ones having the word _MOD_ near the beginning of their name, just after their number. This is over 500k rows this time, not 2M:

More on `.iterrtuples()`

There are actually more nuances when using .itertuples(). To delve into some of those, read this answer by @Romain Capron. Here is a bar chart plot I made of his results:

My plotting code for his results is in python/pandas_plot_bar_chart_better_GREAT_AUTOLABEL_DATA.py in my eRCaGuy_hello_world repo.

Future work

Using Cython (Python compiled into C code), or just raw C functions called by Python, could be faster potentially, but I'm not going to do that for these tests. I'd only look into and speed test those options for big optimizations.

I currently don't know Cython and don't feel the need to learn it. As you can see above, simply using pure vectorization properly already runs incredibly fast, processing 2 million rows in only 0.1 seconds, or 20 million rows per second.

References

A bunch of the official Pandas documentation, especially the DataFrame documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html.
This excellent answer by @cs95 - this is where I learned in particular how to use list comprehension to iterate over a DataFrame.
This answer about itertuples(), by @Romain Capron - I studied it carefully and edited/formatted it.
All of this is my own code, but I want to point out that I had dozens of chats with GitHub Copilot (mostly), Bing AI, and ChatGPT in order to figure out many of these techniques and debug my code as I went.

Bing Chat produced the pretty LaTeX equation for me, with the following prompt. Of course, I verified the output:

Convert this Python code to a pretty equation I can paste onto Stack Overflow:

    val = (
        2 * A_i_minus_2
        + 3 * A_i_minus_1
        + 4 * A
        + 5 * A_i_plus_1
        # Python ternary operator; don't forget parentheses around the entire ternary expression!
        + ((6 * B) if B > 0 else (60 * B))
        + 7 * C
        - 8 * D
    )

Bonus

You can also modify the value of cells with df.at[row, column] = newValue.

for row in range(len(df)):
  df.at[row, 'c1'] = 'data-' + str(df.at[row, 'c1'])
  print(df.at[row, 'c1'], df.at[row, 'c2'])

The output will be:

data-10 100
data-11 110
data-12 120

Upvotes: 17

cs95

Reputation: 402942

How to iterate over rows in a DataFrame in Pandas

Answer: DON'T^*!

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

Vectorization
Cython routines
List Comprehensions (vanilla for loop)
DataFrame.apply(): i) Reductions that can be performed in Cython, ii) Iteration in Python space
items() ~~iteritems()~~ ^{(deprecated since v1.5.0)}
DataFrame.itertuples()
DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

_{* It's actually a little more complicated than "don't". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you're not sure whether you need an iterative solution, you probably don't. PS: To know more about my rationale for writing this answer, skip to the very bottom.}

Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

Next Best Thing: List Comprehensions^*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you're trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]

# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]

# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]

# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don't have NaNs, but this cannot always be guaranteed.

The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
When dealing with mixed data types you should iterate over zip(df['A'], df['B'], ...) instead of df[['A', 'B']].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

_{*Your mileage may vary for the reasons outlined in the Caveats section above.}

An Obvious Example

Let's demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operation, so it will be easy to contrast the performance of the methods discussed above.

Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you're doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).

I should mention, however, that it isn't always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

My Personal Opinion ^*

Most of the analyses performed on the various alternatives to the iter family has been through the lens of performance. However, in most situations you will typically be working on a reasonably sized dataset (nothing beyond a few thousand or 100K rows) and performance will come second to simplicity/readability of the solution.

Here is my personal preference when selecting a method to use for a problem.

For the novice:

Vectorization (when possible); apply(); List Comprehensions; itertuples()/iteritems(); iterrows(); Cython

For the more experienced:

Vectorization (when possible); apply(); List Comprehensions; Cython; itertuples()/iteritems(); iterrows()

Vectorization prevails as the most idiomatic method for any problem that can be vectorized. Always seek to vectorize! When in doubt, consult the docs, or look on Stack Overflow for an existing question on your particular task.

I do tend to go on about how bad apply is in a lot of my posts, but I do concede it is easier for a beginner to wrap their head around what it's doing. Additionally, there are quite a few use cases for apply has explained in this post of mine.

Cython ranks lower down on the list because it takes more time and effort to pull off correctly. You will usually never need to write code with pandas that demands this level of performance that even a list comprehension cannot satisfy.

_{* As with any personal opinion, please take with heaps of salt!}

Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I'm not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

And finally ... a TLDR to summarize this post

Upvotes: 2510

Romain Capron

Reputation: 1785

How to iterate efficiently

If you really have to iterate a Pandas DataFrame, you will probably want to avoid using iterrows(). There are different methods, and the usual iterrows() is far from being the best. `itertuples()`` can be 100 times faster.

In short:

As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and fewer than 255 columns. See bullet (3) below.
Otherwise, use df.itertuples(), except if your columns have special characters such as spaces or -. See bullet (2) below.
It is possible to use itertuples() even if your dataframe has strange columns, by using the last example below. See bullet (4) below.
Only use iterrows() if you cannot use any of the previous solutions. See bullet (1) below.

Different methods to iterate over rows in a Pandas `DataFrame`:

First, for use in all examples below, generate a random dataframe with a million rows and 4 columns, like this:

df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)

The output of all of these examples is shown at the bottom.

The usual iterrows() is convenient, but damn slow:

start_time = time.clock()
result = 0
for _, row in df.iterrows():
    result += max(row['B'], row['C'])

total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

Using the default named itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:
```
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row.B, row.C)

total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
```

Using nameless itertuples() by setting name=None is even faster, but not really convenient, as you have to define a variable per column.

start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    result += max(col2, col3)

total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

Finally, using polyvalent itertuples() is slower than the previous example, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

Output of all code and examples above:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

Plot of these results from @Gabriel Staples in his answer here:

Sometimes loops really are better than vectorized code

As many answers here correctly point out, your default plan in Pandas should be to write vectorized code (with its implicit loops) rather than attempting an explicit loop yourself. But the question remains whether you should ever write loops in Pandas, and if so what's the best way to loop in those situations.

I believe there is at least one general situation where loops are appropriate: when you need to calculate some function that depends on values in other rows in a somewhat complex manner. In this case, the looping code is often simpler, more readable, and less error prone than vectorized code.

The looping code might even be faster too, as you'll see below, so loops might make sense in cases where speed is of utmost importance. But really, those are just going to be subsets of cases where you probably should have been working in numpy/numba (rather than Pandas) to begin with, because optimized numpy/numba will almost always be faster than Pandas.

Let's show this with an example. Suppose you want to take a cumulative sum of a column, but reset it whenever some other column equals zero:

import pandas as pd
import numpy as np

df = pd.DataFrame( { 'x':[1,2,3,4,5,6], 'y':[1,1,1,0,1,1]  } )

#   x  y  desired_result
#0  1  1               1
#1  2  1               3
#2  3  1               6
#3  4  0               4
#4  5  1               9
#5  6  1              15

This is a good example where you could certainly write one line of Pandas to achieve this, although it's not especially readable, especially if you aren't fairly experienced with Pandas already:

df.groupby( (df.y==0).cumsum() )['x'].cumsum()

That's going to be fast enough for most situations, although you could also write faster code by avoiding the groupby, but it will likely be even less readable.

Alternatively, what if we write this as a loop? You could do something like the following with NumPy:

import numba as nb

@nb.jit(nopython=True)  # Optional
def custom_sum(x,y):
    x_sum = x.copy()
    for i in range(1,len(df)):
        if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]
    return x_sum

df['desired_result'] = custom_sum( df.x.to_numpy(), df.y.to_numpy() )

Admittedly, there's a bit of overhead there required to convert DataFrame columns to NumPy arrays, but the core piece of code is just one line of code that you could read even if you didn't know anything about Pandas or NumPy:

if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]

And this code is actually faster than the vectorized code. In some quick tests with 100,000 rows, the above is about 10x faster than the groupby approach. Note that one key to the speed there is numba, which is optional. Without the "@nb.jit" line, the looping code is actually about 10x slower than the groupby approach.

Clearly this example is simple enough that you would likely prefer the one line of pandas to writing a loop with its associated overhead. However, there are more complex versions of this problem for which the readability or speed of the NumPy/numba loop approach likely makes sense.

Upvotes: 14

imz22

Reputation: 2938

Along with the great answers in this post I am going to propose Divide and Conquer approach, I am not writing this answer to abolish the other great answers but to fulfill them with another approach which was working efficiently for me. It has two steps of splitting and merging the pandas dataframe:

PROS of Divide and Conquer:

You don't need to use vectorization or any other methods to cast the type of your dataframe into another type
You don't need to Cythonize your code which normally takes extra time from you
Both iterrows() and itertuples() in my case were having the same performance over entire dataframe
Depends on your choice of slicing index, you will be able to exponentially quicken the iteration. The higher index, the quicker your iteration process.

CONS of Divide and Conquer:

You shouldn't have dependency over the iteration process to the same dataframe and different slice. Meaning if you want to read or write from other slice, it maybe difficult to do that.

=================== Divide and Conquer Approach =================

Step 1: Splitting/Slicing

In this step, we are going to divide the iteration over the entire dataframe. Think that you are going to read a CSV file into pandas df then iterate over it. In may case I have 5,000,000 records and I am going to split it into 100,000 records.

NOTE: I need to reiterate as other runtime analysis explained in the other solutions in this page, "number of records" has exponential proportion of "runtime" on search on the df. Based on the benchmark on my data here are the results:

Number of records | Iteration rate [per second]
========================================
100,000           | 500
500,000           | 200
1,000,000         | 50
5,000,000         | 20

Step 2: Merging

This is going to be an easy step, just merge all the written CSV files into one dataframe and write it into a bigger CSV file.

Here is the sample code:

# Step 1 (Splitting/Slicing)
import pandas as pd
df_all = pd.read_csv('C:/KtV.csv')
df_index = 100000
df_len = len(df)
for i in range(df_len // df_index + 1):
    lower_bound = i * df_index
    higher_bound = min(lower_bound + df_index, df_len)
    # Splitting/slicing df (make sure to copy() otherwise it will be a view
    df = df_all[lower_bound:higher_bound].copy()
    '''
    Write your iteration over the sliced df here
    using iterrows() or intertuples() or ...
    '''
    # Writing into CSV files
    df.to_csv('C:/KtV_prep_' + str(i) + '.csv')



# Step 2 (Merging)
filename = 'C:/KtV_prep_'
df = (pd.read_csv(f) for f in [filename + str(i) + '.csv' for i in range(ktv_len // ktv_index + 1)])
df_prep_all = pd.concat(df)
df_prep_all.to_csv('C:/KtV_prep_all.csv')

Reference:

Efficient way of iteration over datafreame

Concatenate CSV files into one Pandas Dataframe

Upvotes: 2

Ernesto Elsäßer

Reputation: 1267

Probably the most elegant solution (but certainly not the most efficient):

for row in df.values:
    c2 = row[1]
    print(row)
    # ...

for c1, c2 in df.values:
    # ...

Note that:

the documentation explicitly recommends to use .to_numpy() instead
the produced NumPy array will have a dtype that fits all columns, in the worst case object
there are good reasons not to use a loop in the first place

Still, I think this option should be included here, as a straightforward solution to a (one should think) trivial problem.

Upvotes: 9

Sachin

Reputation: 1704

We have multiple options to do the same, and lots of folks have shared their answers.

I found the below two methods easy and efficient to do:

Example:

 import pandas as pd
 inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
 df = pd.DataFrame(inp)
 print (df)

 # With the iterrows method

 for index, row in df.iterrows():
     print(row["c1"], row["c2"])

 # With the itertuples method

 for row in df.itertuples(index=True, name='Pandas'):
     print(row.c1, row.c2)

Note: itertuples() is supposed to be faster than iterrows()

Upvotes: 48

gru

Reputation: 3079

Disclaimer: Although here are so many answers which recommend not using an iterative (loop) approach (and I mostly agree), I would still see it as a reasonable approach for the following situation:

Extend a dataframe with data from an API

Let's say you have a large dataframe which contains incomplete user data. Now you have to extend this data with additional columns, for example, the user's age and gender.

Both values have to be fetched from a backend API. I'm assuming the API doesn't provide a "batch" endpoint (which would accept multiple user IDs at once). Otherwise, you should rather call the API only once.

The costs (waiting time) for the network request surpass the iteration of the dataframe by far. We're talking about network round trip times of hundreds of milliseconds compared to the negligibly small gains in using alternative approaches to iterations.

One expensive network request for each row

So in this case, I would absolutely prefer using an iterative approach. Although the network request is expensive, it is guaranteed being triggered only once for each row in the dataframe. Here is an example using DataFrame.iterrows:

Example

for index, row in users_df.iterrows():
  user_id = row['user_id']

  # Trigger expensive network request once for each row
  response_dict = backend_api.get(f'/api/user-data/{user_id}')

  # Extend dataframe with multiple data from response
  users_df.at[index, 'age'] = response_dict.get('age')
  users_df.at[index, 'gender'] = response_dict.get('gender')

Upvotes: 4

Ashvani Jaiswal

Reputation: 110

df.iterrows() returns tuple(a, b) where a is the index and b is the row.

Upvotes: 6

PJay

Reputation: 2807

You can use the df.iloc function as follows:

for i in range(0, len(df)):
    print(df.iloc[i]['c1'], df.iloc[i]['c2'])

Upvotes: 174

tbrugere

Reputation: 1623

As the accepted answer states, the fastest way to apply a function over rows is to use a vectorized function, the so-called NumPy ufuncs (universal functions).

But what should you do when the function you want to apply isn't already implemented in NumPy?

Well, using the vectorize decorator from numba, you can easily create ufuncs directly in Python like this:

from numba import vectorize, float64

@vectorize([float64(float64)])
def f(x):
    #x is your line, do something with it, and return a float

The documentation for this function is here: Creating NumPy universal functions

Upvotes: 2

bug_spray

Reputation: 1516

Update: cs95 has updated his answer to include plain numpy vectorization. You can simply refer to his answer.

cs95 shows that Pandas vectorization far outperforms other Pandas methods for computing stuff with dataframes.

I wanted to add that if you first convert the dataframe to a NumPy array and then use vectorization, it's even faster than Pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series).

If you add the following functions to cs95's benchmark code, this becomes pretty evident:

def np_vectorization(df):
    np_arr = df.to_numpy()
    return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index)

def just_np_vectorization(df):
    np_arr = df.to_numpy()
    return np_arr[:,0] + np_arr[:,1]

Upvotes: 19

artoby

Reputation: 1940

In short

Use vectorization if possible
If an operation can't be vectorized - use list comprehensions
If you need a single object representing the entire row - use itertuples
If the above is too slow - try swifter.apply
If it's still too slow - try a Cython routine

Benchmark

Upvotes: 18

dna-data

Reputation: 93

Use df.iloc[]. For example, using dataframe 'rows_df':

To get values from a specific row, you can convert the dataframe into ndarray.

Then select the row and column values like this:

Upvotes: 0

François B.

Reputation: 1174

The easiest way, use the apply function

def print_row(row):
   print row['c1'], row['c2']

df.apply(lambda row: print_row(row), axis=1)

Upvotes: 10

shubham ranjan

Reputation: 637

There are so many ways to iterate over the rows in Pandas dataframe. One very simple and intuitive way is:

df = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]})
print(df)
for i in range(df.shape[0]):
    # For printing the second column
    print(df.iloc[i, 1])

    # For printing more than one columns
    print(df.iloc[i, [0, 2]])

Upvotes: 11

Lucas B

Reputation: 2538

I was looking for How to iterate on rows and columns and ended here so:

for i, row in df.iterrows():
    for j, column in row.iteritems():
        print(column)

Upvotes: 58

James L.

Reputation: 14535

You can also do NumPy indexing for even greater speed ups. It's not really iterating but works much better than iteration for certain applications.

subset = row['c1'][0:5]
all = row['c1'][:]

You may also want to cast it to an array. These indexes/selections are supposed to act like NumPy arrays already, but I ran into issues and needed to cast

np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) # Resize every image in an hdf5 file

Upvotes: 7

Hossein Kalbasi

Reputation: 1861

For both viewing and modifying values, I would use iterrows(). In a for loop and by using tuple unpacking (see the example: i, row), I use the row for only viewing the value and use i with the loc method when I want to modify values. As stated in previous answers, here you should not modify something you are iterating over.

for i, row in df.iterrows():
    df_column_A = df.loc[i, 'A']
    if df_column_A == 'Old_Value':
        df_column_A = 'New_value'

Here the row in the loop is a copy of that row, and not a view of it. Therefore, you should NOT write something like row['A'] = 'New_Value', it will not modify the DataFrame. However, you can use i and loc and specify the DataFrame to do the work.

Upvotes: 13

Wes McKinney

Reputation: 105631

You should use df.iterrows(). Though iterating row-by-row is not especially efficient since Series objects have to be created.

Upvotes: 255

morganics

Reputation: 1249

Some libraries (e.g. a Java interop library that I use) require values to be passed in a row at a time, for example, if streaming data. To replicate the streaming nature, I 'stream' my dataframe values one by one, I wrote the below, which comes in handy from time to time.

class DataFrameReader:
  def __init__(self, df):
    self._df = df
    self._row = None
    self._columns = df.columns.tolist()
    self.reset()
    self.row_index = 0

  def __getattr__(self, key):
    return self.__getitem__(key)

  def read(self) -> bool:
    self._row = next(self._iterator, None)
    self.row_index += 1
    return self._row is not None

  def columns(self):
    return self._columns

  def reset(self) -> None:
    self._iterator = self._df.itertuples()

  def get_index(self):
    return self._row[0]

  def index(self):
    return self._row[0]

  def to_dict(self, columns: List[str] = None):
    return self.row(columns=columns)

  def tolist(self, cols) -> List[object]:
    return [self.__getitem__(c) for c in cols]

  def row(self, columns: List[str] = None) -> Dict[str, object]:
    cols = set(self._columns if columns is None else columns)
    return {c : self.__getitem__(c) for c in self._columns if c in cols}

  def __getitem__(self, key) -> object:
    # the df index of the row is at index 0
    try:
        if type(key) is list:
            ix = [self._columns.index(key) + 1 for k in key]
        else:
            ix = self._columns.index(key) + 1
        return self._row[ix]
    except BaseException as e:
        return None

  def __next__(self) -> 'DataFrameReader':
    if self.read():
        return self
    else:
        raise StopIteration

  def __iter__(self) -> 'DataFrameReader':
    return self

Which can be used:

for row in DataFrameReader(df):
  print(row.my_column_name)
  print(row.to_dict())
  print(row['my_column_name'])
  print(row.tolist())

And preserves the values/ name mapping for the rows being iterated. Obviously, is a lot slower than using apply and Cython as indicated above, but is necessary in some circumstances.

Upvotes: 3

Zeitgeist

Reputation: 1520

There is a way to iterate throw rows while getting a DataFrame in return, and not a Series. I don't see anyone mentioning that you can pass index as a list for the row to be returned as a DataFrame:

for i in range(len(df)):
    row = df.iloc[[i]]

Note the usage of double brackets. This returns a DataFrame with a single row.

Upvotes: 17

Grag2015

Reputation: 671

 for ind in df.index:
     print df['c1'][ind], df['c2'][ind]

Upvotes: 26

Zach

Reputation: 1153

Sometimes a useful pattern is:

# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
    print(row_dict)

Which results in:

{'col1':1.0, 'col2':0.1}
{'col1':2.0, 'col2':0.2}

Upvotes: 20

How can I iterate over rows in a Pandas DataFrame?

Answers (30)

How to iterate over Pandas `DataFrame`s without iterating

Results

Summary

Here are the 13 techniques, listed in order of fastest first to slowest last. I recommend never using the last (slowest) 3 to 4 techniques.

Rules of thumb:

The test data

The test equation/calculation

The techniques

Here is the code for all 13 techniques:

Using the pre-shifted rows in the 4 `for` loop techniques as well

More on `.iterrtuples()`

Future work

References

See also

Bonus

How to iterate over rows in a DataFrame in Pandas

Answer: DON'T^*!

Faster than Looping: Vectorization, Cython

Next Best Thing: List Comprehensions^*

An Obvious Example

My Personal Opinion ^*

Further Reading

Why I Wrote this Answer

How to iterate efficiently

Different methods to iterate over rows in a Pandas `DataFrame`:

See also

1. Iterate over `df.index` and access via `at[]`

2. Use `get_loc` with `itertuples()`

3. Convert to a dictionary and iterate over `dict_items`

Sometimes loops really are better than vectorized code

Extend a dataframe with data from an API

One expensive network request for each row

Example

Benchmark

Related Questions

How can I iterate over rows in a Pandas DataFrame?

Answers (30)

How to iterate over Pandas DataFrames without iterating

Results

Summary

Here are the 13 techniques, listed in order of fastest first to slowest last. I recommend never using the last (slowest) 3 to 4 techniques.

Rules of thumb:

The test data

The test equation/calculation

The techniques

Here is the code for all 13 techniques:

Using the pre-shifted rows in the 4 for loop techniques as well

More on .iterrtuples()

Future work

References

See also

Bonus

How to iterate over rows in a DataFrame in Pandas

Answer: DON'T*!

Faster than Looping: Vectorization, Cython

Next Best Thing: List Comprehensions*

An Obvious Example

My Personal Opinion *

Further Reading

Why I Wrote this Answer

How to iterate efficiently

Different methods to iterate over rows in a Pandas DataFrame:

See also

1. Iterate over df.index and access via at[]

2. Use get_loc with itertuples()

3. Convert to a dictionary and iterate over dict_items

Sometimes loops really are better than vectorized code

Extend a dataframe with data from an API

One expensive network request for each row

Example

Benchmark

Related Questions

How to iterate over Pandas `DataFrame`s without iterating

Using the pre-shifted rows in the 4 `for` loop techniques as well

More on `.iterrtuples()`

Answer: DON'T^*!

Next Best Thing: List Comprehensions^*

My Personal Opinion ^*

Different methods to iterate over rows in a Pandas `DataFrame`:

1. Iterate over `df.index` and access via `at[]`

2. Use `get_loc` with `itertuples()`

3. Convert to a dictionary and iterate over `dict_items`