pythonpython-3.xpandasgislatitude-longitude

Reputation: 2617

Python Geo-Spatial Coordinate Format Conversion

I have a dataframe containing 6 columns of coordinate pairs:

Degrees|Minutes|Seconds

(for both latitude and longitude). This is known as the NAD83 format.

I want to convert these into a new dataframe of only 2 columns in decimal format, known as NAD27.

The library I typically use, geopy supports virtually every format, so there actually isn't a dedicated conversion function. I went through the documentation here to be sure: https://geopy.readthedocs.io/en/1.10.0/

Does python have any other means to convert to NAD27?

Upvotes: 2

Answers (2)

Martin Valgur

Reputation: 6322

Let's suppose your DataFrame df contains columns lonD, lonM, lonS, latD, latM and latS. Then the following should work, using geopandas, shapely and pyproj internally.

import geopandas as gpd
import numpy as np
from shapely.geometry import Point

def dms_to_dec(d, m, s):
    sign = 1 - 2 * np.signbit(d)
    return d + sign * m / 60 + sign * s / 3600
    
points = df.apply(lambda row: Point(dms_to_dec(*row[['lonD', 'lonM', 'lonS']]), 
                                    dms_to_dec(*row[['latD', 'latM', 'latS']])),
                  axis=1)
gdf_nad83 = gpd.GeoDataFrame(df, geometry=points, crs={'init': 'EPSG:4269'})
gdf_nad27 = gdf_nad83.to_crs({'init': 'EPSG:4267'})

Upvotes: 4

Martijn Pieters

Reputation: 1124858

Because I ran into this too, and found the df.apply() approach too slow, I switched to using a MultiPoint() object and used vectorised operations, then turning that single object into Point()s with list().

Moreover, we need to account for the fact that DMS columns may only have included the - sign on the D column. If that's the case and you are lucky the DataFrame was created using numpy floats, then "-0.0" has probably been stored as numpy.NZERO (negative zero), in which case we can still recover the sign using numpy.signbit(). If not, the sign might be lost and points just south of the equator or west of the zero-th meridian will appear as just north or east instead.

Just to be explicit: the D, M, S coordinate notation is just that, a different way to note latitude and longitude coordinates, where D, M, and S stand for degrees, (arc) minutes and (arc) seconds. The decimal is another, which combines the degrees value with the arc minutes and arc seconds into a single number; an arc minute is 1/60th of a degree, and an arc second is 1/3600th of a degree, so you can do a little math to sum the values together (preserving the sign of the degree). GeoPy wants to work with decimal values, so you do need to fold the arc-seconds and arc-minutes into the degrees value.

On the other hand, NAD83 and NAD27 are not geodetic datums or geodetic systems, and such systems are notation agnostic. They are simply a standardised way of specifying what coordinate system to use and what point of reference the coordinate system is anchored to.

That said, geopandas can be used to transform between different geodetic datums. The project accepts CRS strings to define what coordinate system to use when interpreting points (of which the geodetic datum is a component); using a coordinate system database such as https://spatialreference.org/ to find the EPSG codes for NAD83 and NAD27, gives us EPSG:4269 and EPSG:4267, respectively. Note that you don't have to create a dataframe here, a GeoSeries is enough if all you want is conversion.

So, given that you have degrees, minutes and seconds, you need to convert those values to decimal coordinates to feed to geopandas. And you'd want to do this fast and efficiently. You can do this by using vectorised calculations (where numpy applies calculations to all rows using very fast arythmatic operations directly on the machine representation of the data, and not on the Python reprentations).

I'm sticking to the same convention here, input is a Pandas DataFrame df that contains columns lonD, lonM, lonS, latD, latM and latS. Using geopandas, numpy and shapely:

import geopandas as gpd
import numpy as np
from shapely.geometry import asMultiPoint

def vec_dms_to_dec(d, m, s):
    """convert d, m, s coordinates to decimals

    Can be used as a vectorised operation on whole numpy arrays,
    each array must have the same shape.

    Handles signs only present on the D column, transparently.
    Note that for -0d Mm Ss inputs, the sign might be have been lost!
    However, if it was preserved as np.NZERO, this function will
    recover it with np.signbit().

    """
    assert d.shape == m.shape == s.shape
    # account for signs only present on d
    if (m >= 0).all() and (s >= 0).all():
        # all s and m values are without signs
        # so only d carries this info. Use the sign *bit* so negative
        # and positive zero are distinguished correctly.
        sign = np.where(np.signbit(d), np.ones_like(d) * -1.0, np.ones_like(d))
    else:
        sign = np.ones_like(d)
    return d + sign * m / 60 + sign * s / 3600

# Generate the column names, grouped by component
comps = ([f"{c}{a}" for c in ("lon", "lat")] for a in 'DMS')

# Create a single MultiPoint object from the vectorised conversions of the
# longitude and latitude columns
mpoint = asMultiPoint(
    vec_dms_to_dec(*(df[c].values for c in cols))
)

# Create a GeoSeries object from the MultiPoints object. Using `list()`
# produces `Point()` objects efficiently, faster than GeoSeries would
# otherwise.
# Interpret the points as using NAD83 == EPSG:4269
coords_nad83 = gpd.GeoSeries(list(mpoint), crs={'init': 'EPSG:4269'})

# Convert the series to NAD27 == EPSG:4267
coords_nad4267 = coords_nad83.to_crs(epsg=4267)

You are then free to convert those back again to values in D, M, S notation:

from shapely.geometry import MultiPoint

def geoseries_to_dms(s, all_signed=True):
    fractions, decimals = np.modf(np.array(MultiPoint(s.to_list())))
    if not all_signed:
        # only the d values signed. Looses information
        # for input values in the open range (-1.0, 0.0)
        fractions = np.abs(fractions)
    fractions, minutes = np.modf(fractions * 60)
    seconds = fractions * 60
    return pd.DataFrame(
        data=np.stack(
            (decimals, minutes, seconds), axis=2
        ).reshape(-1, 6),
        columns=loncols + latcols
    )

The above uses np.modf() to split out the decimal portion from the fraction, after which the absolute value of the fraction can be split out into arc minutes and arc seconds again.

If you still wanted to use a GeoDataFrame, create one from the converted GeoSeries, or just create one from the MultiPoints() object the same way you create a GeoSeries from a MultiPoints() object, by using GeoDataFrame(..., geometry=list(points), ...).

Benchmarking, or why you'd want vectorisation

On vectorisation: My code above takes each of the degrees, minutes and seconds columns as three separate numpy arrays and uses those 3 arrays to create a single array of decimal degree values, in one step across all rows. There is no need for a separate call for just the latitude or just the longitude values, as numpy processes d, m and s as arrays, and doesn't care if they have just one dimension or 15.

This translates to vastly faster execution. To benchmark this, lets create a new dataframe with an arbitrary number of dms coordinates; I found it easier to just generate decimal values and convert these to dms values:

import numpy as np
import pandas as pd
from shapely.geometry import Point, asMultiPoint

def random_world_coords(n):
    coords = np.random.random((2, n))
    coords[0] = coords[0] * 180 - 90  # lat between -90, 90
    coords[1] = coords[1] * 360 - 180 # lon between -180, 180
    # convert to d, m, s
    fractions, decimals = np.modf(coords)
    fractions, minutes = np.modf(fractions * 60)
    seconds = fractions * 60
    return pd.DataFrame(
        data=np.stack((decimals, minutes, seconds), axis=2).reshape(-1, 6),
        columns=["lonD", "lonM", "lonS", "latD", "latM", "latS"]
    )

and define the approaches to convert these values to decimal points, suitable for GeoSeries() consumption, as functions. I removed the sign handling as the random data included signs on all dms columns, which then also makes it trivial to use the same conversion function for both scalar and array operations:

def dms_to_dec(d, m, s):
    """convert d, m, s coordinates to decimals"""
    return d + m / 60 + s / 3600

def martinvalgur_apply(df):
    return df.apply(
        lambda row: Point(
            dms_to_dec(*row[['lonD', 'lonM', 'lonS']]), 
            dms_to_dec(*row[['latD', 'latM', 'latS']])
        ),
        axis=1
    )

def martijnpieters_vectorised(df):
    comps = ([f"{c}{a}" for c in ("lon", "lat")] for a in 'DMS')
    return list(asMultiPoint(
        dms_to_dec(*(df[c].values for c in comps))
    ))

at which point you can test how fast either on is with IPython's %timeit or another benchmarking library:

df100 = random_world_coords(100)
%timeit martinvalgur_apply(df100)
%timeit martijnpieters_vectorised(df100)
# 433 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 96.2 ms ± 7.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That's 100 items, and vectorising is about 4.5 times faster.

If you increase the count to 1000 the numbers the difference becomes much more apparent:

df1000 = random_world_coords(1000)
%timeit martinvalgur_apply(df1000)
%timeit martijnpieters_vectorised(df1000)
# 4.31 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 35.7 ms ± 909 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So at 1000 rows, vectorising takes mere milliseconds still and is taking less time because we are now hitting optimisations used for larger datasets, but using the time taken to run df.apply() on those 1000 rows has ballooned to over 4 seconds.

(Note: I also ran the tests with a deepcopy of the input for each test created with DataFrame.copy() to make sure that I wasn't gaining advantages of already-processed data, but the timings still went down, not up, for the 100 -> 1000 rows case).

The non-vectorised version takes time directly proportional to the number of rows, so the numbers for 10k rows are predictable:

df10k = random_world_coords(10_000)
%timeit martinvalgur_apply(df10k)
%timeit martijnpieters_vectorised(df10k)
# 44.1 s ± 1.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 331 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That the df.apply() version took 44 seconds was expected, but I did have to wait a full 5 minutes for the result to come in as IPython still runs the test 7 times.

The vectorised approach clocked in at a mere 331ms, so we can test just that version at 1 million rows:

df1m = random_world_coords(1_000_000)
%timeit martijnpieters_vectorised(df1m)
# 3.18 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So the vectorised approach scales linearly as well, but it started from a far lower value. Most of this time goes into creating the list of Point() objects from the MultiPoint() object, something the geopandas project could improve on by supporting Shapely GeometrySequence objects natively.

Upvotes: 4

Python Geo-Spatial Coordinate Format Conversion

Answers (2)

Benchmarking, or why you'd want vectorisation

Related Questions