Reputation: 89
I am new to Python and would like to rebuild this example. I have longitude and latitude data about NYC Taxi pick-ups and drop-offs, however, I need to change the data to the Web Mercartor format (this cannot be found in the example above). I found a function which can take one pair of longitude and latitude values and change it to Web Mercartor format, which was taken from here, it looks as follows:
import math
def toWGS84(xLon, yLat):
# Check if coordinate out of range for Latitude/Longitude
if (abs(xLon) < 180) and (abs(yLat) > 90):
return
# Check if coordinate out of range for Web Mercator
# 20037508.3427892 is full extent of Web Mercator
if (abs(xLon) > 20037508.3427892) or (abs(yLat) > 20037508.3427892):
return
semimajorAxis = 6378137.0 # WGS84 spheriod semimajor axis
latitude = (1.5707963267948966 - (2.0 * math.atan(math.exp((-1.0 * yLat) / semimajorAxis)))) * (180/math.pi)
longitude = ((xLon / semimajorAxis) * 57.295779513082323) - ((math.floor((((xLon / semimajorAxis) * 57.295779513082323) + 180.0) / 360.0)) * 360.0)
return [longitude, latitude]
def toWebMercator(xLon, yLat):
# Check if coordinate out of range for Latitude/Longitude
if (abs(xLon) > 180) and (abs(yLat) > 90):
return
semimajorAxis = 6378137.0 # WGS84 spheriod semimajor axis
east = xLon * 0.017453292519943295
north = yLat * 0.017453292519943295
northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
easting = semimajorAxis * east
return [easting, northing]
def main():
print(toWebMercator(-105.816001, 40.067633))
print(toWGS84(-11779383.349100526, 4875775.395628653))
if __name__ == '__main__':
main()
How do I apply this data to every pair of long/lat coordinates in my pandas Dataframe and save the output in the same pandasDF?
df.tail()
| longitude | latitude
____________|__________________|______________
11135465 | -73.986893 | 40.761093
1113546 | -73.979645 | 40.747814
11135467 | -74.001244 | 40.743172
11135468 | -73.997818 | 40.726055
...
Upvotes: 2
Views: 1881
Reputation: 32125
If you want to keep a kind of readable math function, and an easy conversion of the current function, use eval
:
df.eval("""
northing = 3189068.5 * log((1.0 + sin(latitude * 0.017453292519943295)) / (1.0 - sin(latitude * 0.017453292519943295)))
easting = 6378137.0 * longitude * 0.017453292519943295""", inplace=False)
Out[51]:
id longitude latitude northing easting
0 11135465 -73.986893 40.761093 4.977167e+06 -8.236183e+06
1 1113546 -73.979645 40.747814 4.975215e+06 -8.235376e+06
2 11135467 -74.001244 40.743172 4.974533e+06 -8.237781e+06
3 11135468 -73.997818 40.726055 4.972018e+06 -8.237399e+06
You will have to work a bit on the syntax as you cannot use if
statements, but you can easily filter out the out-of-boundaries data before calling eval
. You can also use inplace=True
if you want to directly assign the new columns.
If you aren't that interested in keeping the math syntax and is searching for full speed, it is likely the numpy answer will perform still faster.
Upvotes: 0
Reputation: 5601
With a dataset that size, what would help you the most is understanding how to do things the pandas
way. Iterating over rows will yield terrible performance compared to the built in vectorized methods.
import pandas as pd
import numpy as np
df = pd.read_csv('/yellow_tripdata_2016-06.csv')
df.head(5)
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2016-06-09 21:06:36 2016-06-09 21:13:08 2 0.79 -73.983360 40.760937 1 N -73.977463 40.753979 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
1 2 2016-06-09 21:06:36 2016-06-09 21:35:11 1 5.22 -73.981720 40.736668 1 N -73.981636 40.670242 1 22.0 0.5 0.5 4.00 0.0 0.3 27.30
2 2 2016-06-09 21:06:36 2016-06-09 21:13:10 1 1.26 -73.994316 40.751072 1 N -74.004234 40.742168 1 6.5 0.5 0.5 1.56 0.0 0.3 9.36
3 2 2016-06-09 21:06:36 2016-06-09 21:36:10 1 7.39 -73.982361 40.773891 1 N -73.929466 40.851540 1 26.0 0.5 0.5 1.00 0.0 0.3 28.30
4 2 2016-06-09 21:06:36 2016-06-09 21:23:23 1 3.10 -73.987106 40.733173 1 N -73.985909 40.766445 1 13.5 0.5 0.5 2.96 0.0 0.3 17.76
This dataset has 11,135,470 rows, which isn't "big data," but isn't small. Rather than writing a function and applying it to every row, you'll get a lot more performance by performing parts of the function to individual columns. I would turn this function:
def toWebMercator(xLon, yLat):
# Check if coordinate out of range for Latitude/Longitude
if (abs(xLon) > 180) and (abs(yLat) > 90):
return
semimajorAxis = 6378137.0 # WGS84 spheriod semimajor axis
east = xLon * 0.017453292519943295
north = yLat * 0.017453292519943295
northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
easting = semimajorAxis * east
return [easting, northing]
into this:
SEMIMAJORAXIS = 6378137.0 # typed in all caps since this is a static value
df['pickup_east'] = df['pickup_longitude'] * 0.017453292519943295 # takes all pickup longitude values, multiples them, then saves as a new column named pickup_east.
df['pickup_north'] = df['pickup_latitude'] * 0.017453292519943295
# numpy functions allow you to calculate an entire column's worth of values by simply passing in the column.
df['pickup_northing'] = 3189068.5 * np.log((1.0 + np.sin(df['pickup_north'])) / (1.0 - np.sin(df['pickup_north'])))
df['pickup_easting'] = SEMIMAJORAXIS * df['pickup_east']
You then have pickup_easting
and pickup_northing
columns with the calculated values.
For my laptop, this takes:
CPU times: user 1.01 s, sys: 286 ms, total: 1.3 s
Wall time: 763 ms
For all 11m rows. 15 minutes --> seconds.
I got rid of the checks on the values- you could do something like:
df = df[(df['pickup_longitude'].abs() <= 180) & (df['pickup_latitude'].abs() <= 90)]
This uses boolean indexing, which again, is orders of magnitude faster than looping.
Upvotes: 1
Reputation: 294526
try:
df[['longitude', 'latitude']].apply(
lambda x: pd.Series(toWebMercator(*x), ['xLon', 'yLay']),
axis=1
)
Upvotes: 1