Reputation: 967
I have a huge dataframe
. The structure data looks like this:
df
ID Annotation X Y
A Boarding 767513.9918 9425956.2571
A Alighting 767154.1396 9427584.0004
B Boarding 767450.5277 9432627.9543
B Alighting 767495.0101 9426797.1772
C Boarding 767648.9507 9426442.5497
C Alighting 767037.0309 9428878.9032
........
The X and Y data used UTM coordinates. I want to calculate the distance between each ID boarding and alighting. My question is quietly similar but different to this question (Distance matrix in Python Pandas). My expected result looks like this:
result
ID Anotation_1 X_1 Y_1 Anotation_2 X_2 Y_2 Dist
A Boarding 767513.99 9425956.26 Alighting 767154.14 9427584.00 1667.05
B Boarding 767450.53 9432627.95 Alighting 767495.01 9426797.18 5830.95
C Boarding 767648.95 9426442.55 Alighting 767037.03 9428878.90 2512.02
.......
Thank you for your help.
Upvotes: 0
Views: 932
Reputation: 75120
I am using unstack()
:
m=(df.assign(k=(df.groupby('ID').cumcount()+1).astype(str)).
set_index(['ID','k']).unstack().sort_values(by='k',axis=1))
m.columns=m.columns.map('_'.join)
m=m.assign(Dist=np.sqrt((m.X_1 - m.X_2)**2 + (m.Y_1 - m.Y_2)**2))
print(m)
Upvotes: 2
Reputation: 149125
I would pivot the dataframe:
result = df.pivot('ID', 'Annotation', ['X', 'Y'])
to get
X Y
Annotation Alighting Boarding Alighting Boarding
ID
A 767154.1396 767513.9918 9.427584e+06 9.425956e+06
B 767495.0101 767450.5277 9.426797e+06 9.432628e+06
C 767037.0309 767648.9507 9.428879e+06 9.426443e+06
Then I would rename the columns and reindex:
ix = result.columns.to_frame()
result.columns = ix['Annotation'] + '_' + ix.iloc[:,0]
result = result.reindex(columns=['Alighting_X', 'Alighting_Y', 'Boarding_X', 'Boarding_Y'])
to get:
Alighting_X Alighting_Y Boarding_X Boarding_Y
ID
A 767154.1396 9.427584e+06 767513.9918 9.425956e+06
B 767495.0101 9.426797e+06 767450.5277 9.432628e+06
C 767037.0309 9.428879e+06 767648.9507 9.426443e+06
It is now easy to compute the distance:
result['Dist'] = np.sqrt((result.Alighting_X - result.Boarding_X)**2 + (result.Alighting_Y - result.Boarding_Y)**2)
to finally get:
Alighting_X Boarding_X Alighting_Y Boarding_Y Dist
ID
A 767154.1396 767513.9918 9.427584e+06 9.425956e+06 1667.045847
B 767495.0101 767450.5277 9.426797e+06 9.432628e+06 5830.946773
C 767037.0309 767648.9507 9.428879e+06 9.426443e+06 2512.023929
Upvotes: 2
Reputation: 839
One way to approach this, assuming the input is clean and correct, would be using groupby
:
df = df.groupby('ID').apply(lambda x: pd.Series(x.values[0:2,2:4].flatten())) # (*)
df.columns=['X_1','Y_1','X_2','Y_2']
#df.reset_index() # Uncomment if you want 'ID' as a column and not an Index
As for the other columns in your desired result: The Anotation_1
and Anotation_2
are always constant, so I didn't bother to include them. The Dist
column -- well, you can calculate it now, given the new columns, or you could change the code above to calculate the distance already while traversing the numbers in step (*)
above, thereby changing our code to something like: (here used dummy distance calculation, replace it with yours!)
def my_func(pdf):
return pd.Series([pdf.values[0,2], pdf.values[0,3], pdf.values[1,2], pdf.values[1,3],
np.sqrt((pdf.values[0,2]-pdf.values[1,2])**2+(pdf.values[0,3]-pdf.values[1,3])**2) # <= your distance calculation goes here...
])
df = df.groupby('ID').apply(my_func)
df.columns=['X_1','Y_1','X_2','Y_2','Dist']
#df.reset_index() # Uncomment if you want 'ID' as a column and not an Index
Update: If you insist on including those constant columns, you can simply add them later like so: (but why would you? especially if it's a large DataFrame
...)
df['Annotation_1'] = 'Boarding'
df['Annotation_2'] = 'Alighting'
# And if you further insist on a specific ordering of the columns, you can go with:
df = df[['Annotation_1', 'X_1', 'Y_1', 'Annotation_2', 'X_2', 'Y_2', 'Dist']]
Upvotes: 1