How would one vectorize over a pandas dataframe column over a range of rows?

Question

So I have a Pandas DataFrame with x columns that have y rows. The data in the DataFrame are float64 values. I'm trying to calculate the slope correlation between two columns, but for the range of a single column (e.g. column has 25000 rows, I only want values ranging from 5-10, which happen to be in rows 2000-4000). In order to do so, I was going to iterate in a way demonstrated by the following psuedocode:

for i in range(i, len(df['Column 1']))
    if df.loc[i, 'Column 1'] <= 10.0 & df.loc[i, 'Column 1'] >= 5.0:
        value = df.loc[i, 'Column 1'] / df.loc[i, 'Column 2']
        df['New Column'].append(value)

Note: the above code isn't meant to work; more just an outline of what I am trying to accomplish

I was looking at ways to iterate through Pandas DataFrames, and came across this link: How to iterate over rows in a Pandas DataFrame.

One of the answers refers to much better ways of manipulating data besides brute iteration: "Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting." Thus, I want to vectorize my approach so I can manipulate multiple rows at a time to drastically decrease my runtime.

I was looking through other questions, and most answers are somewhat helpful but I need help with the specifics for my particular problem. I think the bulk of what I am trying to accomplish can be summarized with the following list:

Given a Pandas DataFrame that contains multiple columns, iterate through a single column.
In the single column, iterate through a certain range of values (e.g. over the course of 10k rows where values increase from 1 to 100 from 1st row to 10kth row, only iterate over values 20-50).

Sorry in advance for the repetitive nature of my question, I'm just really struggling with this particular problem in trying to create efficient iteration code.

Poe Dator · Accepted Answer

Bob,

Just use loc to select rows with conditions and then enter formula with column references:

df.loc[(df['Column 1'] <= 10.0) & (df['Column 1'] >= 5.0), 'New Column'] = df['Column 1'] / df['Column 2']

In your case, between is more elegant:

 df.loc[(df['Column 1'].between(5, 10, inclusive=True), 'New Column'] = df['Column 1'] / df['Column 2']

Anyhow, direct math operations are orders of magniture faster than iterations. Behold the power of Pandas! :)

How would one vectorize over a pandas dataframe column over a range of rows?

Answers (1)

Related Questions