Reputation: 135
How do I write the nested for
loop to access every other row with respect to a row within a pandas.dataframe
?
I am trying to perform some operations between rows in a pandas.dataframe
The operation for my example code is calculating Euclidean distances between each row with each other row.
The results are then saved into a some list in the form
[(row_reference, name, dist)]
.
I understand how to access each row in a pandas.dataframe using df.itterrows()
but I'm not sure how to access every other row with respect to the current row in order to perform the inter-row operation.
import pandas as pd
import numpy
import math
df = pd.DataFrame([{'name': "Bill", 'c1': 3, 'c2': 8}, {'name': "James", 'c1': 4, 'c2': 12},
{'name': "John", 'c1': 12, 'c2': 26}])
#Euclidean distance function where x1=c1_row1 ,x2=c1_row2, y1=c2_row1, #y2=c2_row2
def edist(x1, x2, y1, y2):
dist = math.sqrt(math.pow((x1 - x2),2) + math.pow((y1 - y2),2))
return dist
# Calculate Euclidean distance for one row (e.g. Bill) against each other row
# (e.g. "James" and "John"). Save results to a list (N_name, dist).
all_results = []
for index, row in df.iterrows():
results = []
# secondary loop to look for OTHER rows with respect to the current row
# results.append(row2['name'],edist())
all_results.append(row,results)
I hope to perform some operation edist()
on all rows with respect to the current row/index.
I expect the loop to do the following:
In[1]:
result = []
result.append(['James',edist(3,4,8,12)])
result.append(['John',edist(3,12,8,26)])
results_all=[]
results_all.append([0,result])
result2 = []
result2.append(['John',edist(4,12,12,26)])
result2.append(['Bill',edist(4,3,12,8)])
results_all.append([1,result2])
result3 = []
result3.append(['Bill',edist(12,3,26,8)])
result3.append(['James', edist(12,4,26,12)])
results_all.append([2,result3])
results_all
With the following expected resulting output:
OUT[1]:
[[0, [['James', 4.123105625617661], ['John', 20.12461179749811]]],
[1, [['John', 16.1245154965971], ['Bill', 4.123105625617661]]],
[2, [['Bill', 20.12461179749811], ['James', 16.1245154965971]]]]
Upvotes: 2
Views: 1008
Reputation: 107587
Consider shift
and avoid any rowwise looping. And because you run straightforward arithmetic, run the expression directly on columns using help of numpy
for vectorized calculation.
import numpy as np
df = (df.assign(c1_shift = lambda x: x['c1'].shift(1),
c2_shift = lambda x: x['c2'].shift(1))
)
df['dist'] = np.sqrt(np.power(df['c1'] - df['c1_shift'], 2) +
np.power(df['c2'] - df['c2_shift'], 2))
print(df)
# name c1 c2 c1_shift c2_shift dist
# 0 Bill 3 8 NaN NaN NaN
# 1 James 4 12 3.0 8.0 4.123106
# 2 John 12 26 4.0 12.0 16.124515
Should you want every row combination with each other, consider a cross join on itself and query out reverse duplicates:
df = (pd.merge(df.assign(key=1), df.assign(key=1), on="key")
.query("name_x < name_y")
.drop(columns=['key'])
)
df['dist'] = np.sqrt(np.power(df['c1_x'] - df['c1_y'], 2) +
np.power(df['c2_x'] - df['c2_y'], 2))
print(df)
# name_x c1_x c2_x name_y c1_y c2_y dist
# 1 Bill 3 8 James 4 12 4.123106
# 2 Bill 3 8 John 12 26 20.124612
# 5 James 4 12 John 12 26 16.124515
Upvotes: 1
Reputation: 150735
If you data is not too long, you can check out scipy's distance_matrix
:
all_results = pd.DataFrame(distance_matrix(df[['c1','c2']],df[['c1','c2']]),
index=df['name'],
columns=df['name'])
Output:
name Bill James John
name
Bill 0.000000 4.123106 20.124612
James 4.123106 0.000000 16.124515
John 20.124612 16.124515 0.000000
Upvotes: 1