Reputation: 67
So I have a Pandas DataFrame with x columns that have y rows. The data in the DataFrame are float64 values. I'm trying to calculate the slope correlation between two columns, but for the range of a single column (e.g. column has 25000 rows, I only want values ranging from 5-10, which happen to be in rows 2000-4000). In order to do so, I was going to iterate in a way demonstrated by the following psuedocode:
for i in range(i, len(df['Column 1']))
if df.loc[i, 'Column 1'] <= 10.0 & df.loc[i, 'Column 1'] >= 5.0:
value = df.loc[i, 'Column 1'] / df.loc[i, 'Column 2']
df['New Column'].append(value)
Note: the above code isn't meant to work; more just an outline of what I am trying to accomplish
I was looking at ways to iterate through Pandas DataFrames, and came across this link: How to iterate over rows in a Pandas DataFrame.
One of the answers refers to much better ways of manipulating data besides brute iteration: "Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting." Thus, I want to vectorize my approach so I can manipulate multiple rows at a time to drastically decrease my runtime.
I was looking through other questions, and most answers are somewhat helpful but I need help with the specifics for my particular problem. I think the bulk of what I am trying to accomplish can be summarized with the following list:
Sorry in advance for the repetitive nature of my question, I'm just really struggling with this particular problem in trying to create efficient iteration code.
Upvotes: 0
Views: 1059
Reputation: 4913
Bob,
Just use loc to select rows with conditions and then enter formula with column references:
df.loc[(df['Column 1'] <= 10.0) & (df['Column 1'] >= 5.0), 'New Column'] = df['Column 1'] / df['Column 2']
In your case, between
is more elegant:
df.loc[(df['Column 1'].between(5, 10, inclusive=True), 'New Column'] = df['Column 1'] / df['Column 2']
Anyhow, direct math operations are orders of magniture faster than iterations. Behold the power of Pandas! :)
Upvotes: 2