s510
s510

Reputation: 2832

Calculation of variance of Geo coordinates

How to calculate the variance of location details

Location has latitude and longitude. I am looking for a single value that will capture the variance of the location details (not separate variance for latitude and longitude). What is the best way to achieve that?

>>> pdf = pd.DataFrame({'latitude': {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
              'longitude': {0: 29.0, 8: 10.0, 14: 36.0, 15: -9.0, 2: 121.0}
             })

>>> pdf

  latitude  longitude

0   47.0    29.0
8   54.0    10.0
14  55.0    36.0
15  39.0    -9.0
2   31.0    121.0

As per numpy documentation, np.var either flattens and then calculates the variance or gives per column wise.

axis None or int or tuple of ints, optional Axis or axes along which the variance is computed. The default is to compute the variance of the flattened array.

Expected (just an example)

>>> variance(pdf)
27.9

I would like to understand if the coordinates are close to each other. What is the best possible approach to get a "combined variance"?

Upvotes: 1

Views: 938

Answers (2)

ivanp
ivanp

Reputation: 350

Single variance measure, converting latlong to cartesian (from recipe).

import pandas as pd
import numpy as np

pdf = pd.DataFrame(
    {
        "latitude": {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
        "longitude": {0: 29.0, 8: 10.0, 14: 36.0, 15: -9.0, 2: 121.0},
    }
)

# Lat long is here interpreted as points on a sphere.
# We want to find average distance between all the points and the center of the points.
# To do this we project the spherical coordinates to cartesian coordinates.
def get_cartesian(latlon):
    lat, lon = latlon
    lat, lon = np.deg2rad(lat), np.deg2rad(lon)
    R = 6371  # radius of the earth
    x = R * np.cos(lat) * np.cos(lon)
    y = R * np.cos(lat) * np.sin(lon)
    z = R * np.sin(lat)

    return [x, y, z]


def dist_to_center(coords, center):
    return np.linalg.norm(np.array(coords) - np.array(center))


pdf = pdf.assign(
    latlong=pd.Series([x for x in zip(pdf.latitude.values, pdf.longitude.values)], index=pdf.index),
    cartesian=lambda x: x["latlong"].apply(get_cartesian),
    # split out cartesian coordinates
    x=lambda c: c["cartesian"].apply(lambda x: x[0]),
    y=lambda c: c["cartesian"].apply(lambda x: x[1]),
    z=lambda c: c["cartesian"].apply(
        lambda x: x[2],
    ),
    # calculate center point
    center_x=lambda cn: cn["x"].mean(),
    center_y=lambda cn: cn["y"].mean(),
    center_z=lambda cn: cn["z"].mean(),
    center_coord=lambda x: x[["center_x", "center_y", "center_z"]].apply(lambda x: [x[0], x[1], x[2]], axis=1),
    # calculate the individual points' distance from the center point
    variance_from_center=lambda x: x.apply(lambda x: dist_to_center(x["cartesian"], x["center_coord"]), axis=1),
)

# get single mean for all the points' distance from the center defined by the points' mean position
variance = pdf["variance_from_center"].mean()

result:

2754.22

Upvotes: 1

blackraven
blackraven

Reputation: 5637

If I understood you correctly, you're looking for a score to describe how close the a group of coordinates are. So if this score is higher, the coordinates are spread further apart.

You could create a new feature by multiplying long*lat, then use the variance of this new feature as the score to compare different groups of coordinates. Let me illustrate with an example:

import matplotlib as plt
import pandas as pd

#these points are closer together
df1 = pd.DataFrame({'latitude': {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
                   'longitude': {0: 54.0, 8: 55.0, 14: 39.0, 15: 31.0, 2: 47.0} })
df1['new'] = (df1['latitude']-df1['latitude'].mean()).mul(df1['longitude']-df1['longitude'].mean()).div(100)
score = df1['new'].var()
df1.plot(kind='scatter', x='longitude', y='latitude')

Output score 0.4407372

enter image description here

#these points are having the same spread, but at different location
df2 = pd.DataFrame({'latitude': {0: 147.0, 8: 154.0, 14: 155.0, 15: 139.0, 2: 131.0},
                   'longitude': {0: 154.0, 8: 155.0, 14: 139.0, 15: 131.0, 2: 147.0} })
df2['new'] = (df2['latitude']-df2['latitude'].mean()).mul(df2['longitude']-df2['longitude'].mean()).div(100)
score = df2['new'].var()
df2.plot(kind='scatter', x='longitude', y='latitude')

Output score 0.4407372

enter image description here

#these points are further apart
df3 = pd.DataFrame({'latitude': {0: 14.0, 8: 15.0, 14: 155.0, 15: 13.0, 2: 131.0},
                   'longitude': {0: 15.0, 8: 215.0, 14: 39.0, 15: 131.0, 2: 147.0} })
df3['new'] = (df3['latitude']-df3['latitude'].mean()).mul(df3['longitude']-df3['longitude'].mean()).div(100)
score = df3['new'].var()
df3.plot(kind='scatter', x='longitude', y='latitude')

Output score 2332.5498432

enter image description here

Upvotes: 1

Related Questions