Reputation: 2832
variance
of location detailsLocation has latitude
and longitude
. I am looking for a single value that will capture the variance of the location details (not separate variance for latitude and longitude). What is the best way to achieve that?
>>> pdf = pd.DataFrame({'latitude': {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
'longitude': {0: 29.0, 8: 10.0, 14: 36.0, 15: -9.0, 2: 121.0}
})
>>> pdf
latitude longitude
0 47.0 29.0
8 54.0 10.0
14 55.0 36.0
15 39.0 -9.0
2 31.0 121.0
As per numpy documentation, np.var
either flattens and then calculates the variance or gives per column wise.
axis None or int or tuple of ints, optional Axis or axes along which the variance is computed. The default is to compute the variance of the flattened array.
Expected (just an example)
>>> variance(pdf)
27.9
I would like to understand if the coordinates are close to each other. What is the best possible approach to get a "combined variance"?
Upvotes: 1
Views: 938
Reputation: 350
Single variance measure, converting latlong to cartesian (from recipe).
import pandas as pd
import numpy as np
pdf = pd.DataFrame(
{
"latitude": {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
"longitude": {0: 29.0, 8: 10.0, 14: 36.0, 15: -9.0, 2: 121.0},
}
)
# Lat long is here interpreted as points on a sphere.
# We want to find average distance between all the points and the center of the points.
# To do this we project the spherical coordinates to cartesian coordinates.
def get_cartesian(latlon):
lat, lon = latlon
lat, lon = np.deg2rad(lat), np.deg2rad(lon)
R = 6371 # radius of the earth
x = R * np.cos(lat) * np.cos(lon)
y = R * np.cos(lat) * np.sin(lon)
z = R * np.sin(lat)
return [x, y, z]
def dist_to_center(coords, center):
return np.linalg.norm(np.array(coords) - np.array(center))
pdf = pdf.assign(
latlong=pd.Series([x for x in zip(pdf.latitude.values, pdf.longitude.values)], index=pdf.index),
cartesian=lambda x: x["latlong"].apply(get_cartesian),
# split out cartesian coordinates
x=lambda c: c["cartesian"].apply(lambda x: x[0]),
y=lambda c: c["cartesian"].apply(lambda x: x[1]),
z=lambda c: c["cartesian"].apply(
lambda x: x[2],
),
# calculate center point
center_x=lambda cn: cn["x"].mean(),
center_y=lambda cn: cn["y"].mean(),
center_z=lambda cn: cn["z"].mean(),
center_coord=lambda x: x[["center_x", "center_y", "center_z"]].apply(lambda x: [x[0], x[1], x[2]], axis=1),
# calculate the individual points' distance from the center point
variance_from_center=lambda x: x.apply(lambda x: dist_to_center(x["cartesian"], x["center_coord"]), axis=1),
)
# get single mean for all the points' distance from the center defined by the points' mean position
variance = pdf["variance_from_center"].mean()
result:
2754.22
Upvotes: 1
Reputation: 5637
If I understood you correctly, you're looking for a score to describe how close the a group of coordinates are. So if this score is higher, the coordinates are spread further apart.
You could create a new feature by multiplying long*lat, then use the variance of this new feature as the score to compare different groups of coordinates. Let me illustrate with an example:
import matplotlib as plt
import pandas as pd
#these points are closer together
df1 = pd.DataFrame({'latitude': {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
'longitude': {0: 54.0, 8: 55.0, 14: 39.0, 15: 31.0, 2: 47.0} })
df1['new'] = (df1['latitude']-df1['latitude'].mean()).mul(df1['longitude']-df1['longitude'].mean()).div(100)
score = df1['new'].var()
df1.plot(kind='scatter', x='longitude', y='latitude')
Output score 0.4407372
#these points are having the same spread, but at different location
df2 = pd.DataFrame({'latitude': {0: 147.0, 8: 154.0, 14: 155.0, 15: 139.0, 2: 131.0},
'longitude': {0: 154.0, 8: 155.0, 14: 139.0, 15: 131.0, 2: 147.0} })
df2['new'] = (df2['latitude']-df2['latitude'].mean()).mul(df2['longitude']-df2['longitude'].mean()).div(100)
score = df2['new'].var()
df2.plot(kind='scatter', x='longitude', y='latitude')
Output score 0.4407372
#these points are further apart
df3 = pd.DataFrame({'latitude': {0: 14.0, 8: 15.0, 14: 155.0, 15: 13.0, 2: 131.0},
'longitude': {0: 15.0, 8: 215.0, 14: 39.0, 15: 131.0, 2: 147.0} })
df3['new'] = (df3['latitude']-df3['latitude'].mean()).mul(df3['longitude']-df3['longitude'].mean()).div(100)
score = df3['new'].var()
df3.plot(kind='scatter', x='longitude', y='latitude')
Output score 2332.5498432
Upvotes: 1