CFM
CFM

Reputation: 89

Apply function to every row in Pandas Dataframe

I am new to Python and would like to rebuild this example. I have longitude and latitude data about NYC Taxi pick-ups and drop-offs, however, I need to change the data to the Web Mercartor format (this cannot be found in the example above). I found a function which can take one pair of longitude and latitude values and change it to Web Mercartor format, which was taken from here, it looks as follows:

import math
def toWGS84(xLon, yLat):
    # Check if coordinate out of range for Latitude/Longitude
    if (abs(xLon) < 180) and (abs(yLat) > 90):
        return

    # Check if coordinate out of range for Web Mercator
    # 20037508.3427892 is full extent of Web Mercator
    if (abs(xLon) > 20037508.3427892) or (abs(yLat) > 20037508.3427892):
        return

    semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis

    latitude = (1.5707963267948966 - (2.0 * math.atan(math.exp((-1.0 * yLat) / semimajorAxis)))) * (180/math.pi)
    longitude = ((xLon / semimajorAxis) * 57.295779513082323) - ((math.floor((((xLon / semimajorAxis) * 57.295779513082323) + 180.0) / 360.0)) * 360.0)

    return [longitude, latitude]



def toWebMercator(xLon, yLat):
    # Check if coordinate out of range for Latitude/Longitude
    if (abs(xLon) > 180) and (abs(yLat) > 90):
        return

    semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis
    east = xLon * 0.017453292519943295
    north = yLat * 0.017453292519943295

    northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
    easting = semimajorAxis * east

    return [easting, northing]

def main():
    print(toWebMercator(-105.816001, 40.067633))
    print(toWGS84(-11779383.349100526, 4875775.395628653))

if __name__ == '__main__':
    main()

How do I apply this data to every pair of long/lat coordinates in my pandas Dataframe and save the output in the same pandasDF?

df.tail()
            |    longitude     |    latitude
____________|__________________|______________
11135465    |    -73.986893    |    40.761093  
1113546     |    -73.979645    |    40.747814  
11135467    |    -74.001244    |    40.743172  
11135468    |    -73.997818    |    40.726055  
...

Upvotes: 2

Views: 1881

Answers (3)

Zeugma
Zeugma

Reputation: 32125

If you want to keep a kind of readable math function, and an easy conversion of the current function, use eval:

df.eval("""
northing = 3189068.5 * log((1.0 + sin(latitude * 0.017453292519943295)) / (1.0 - sin(latitude * 0.017453292519943295)))
easting = 6378137.0 * longitude * 0.017453292519943295""", inplace=False)
Out[51]: 
         id  longitude   latitude      northing       easting
0  11135465 -73.986893  40.761093  4.977167e+06 -8.236183e+06
1   1113546 -73.979645  40.747814  4.975215e+06 -8.235376e+06
2  11135467 -74.001244  40.743172  4.974533e+06 -8.237781e+06
3  11135468 -73.997818  40.726055  4.972018e+06 -8.237399e+06

You will have to work a bit on the syntax as you cannot use if statements, but you can easily filter out the out-of-boundaries data before calling eval. You can also use inplace=True if you want to directly assign the new columns.

If you aren't that interested in keeping the math syntax and is searching for full speed, it is likely the numpy answer will perform still faster.

Upvotes: 0

DataSwede
DataSwede

Reputation: 5601

With a dataset that size, what would help you the most is understanding how to do things the pandas way. Iterating over rows will yield terrible performance compared to the built in vectorized methods.

import pandas as pd
import numpy as np

df = pd.read_csv('/yellow_tripdata_2016-06.csv')
df.head(5)

VendorID    tpep_pickup_datetime    tpep_dropoff_datetime   passenger_count trip_distance   pickup_longitude    pickup_latitude RatecodeID  store_and_fwd_flag  dropoff_longitude   dropoff_latitude    payment_type    fare_amount extra   mta_tax tip_amount  tolls_amount    improvement_surcharge   total_amount
0   2   2016-06-09 21:06:36 2016-06-09 21:13:08 2   0.79    -73.983360  40.760937   1   N   -73.977463  40.753979   2   6.0 0.5 0.5 0.00    0.0 0.3 7.30
1   2   2016-06-09 21:06:36 2016-06-09 21:35:11 1   5.22    -73.981720  40.736668   1   N   -73.981636  40.670242   1   22.0    0.5 0.5 4.00    0.0 0.3 27.30
2   2   2016-06-09 21:06:36 2016-06-09 21:13:10 1   1.26    -73.994316  40.751072   1   N   -74.004234  40.742168   1   6.5 0.5 0.5 1.56    0.0 0.3 9.36
3   2   2016-06-09 21:06:36 2016-06-09 21:36:10 1   7.39    -73.982361  40.773891   1   N   -73.929466  40.851540   1   26.0    0.5 0.5 1.00    0.0 0.3 28.30
4   2   2016-06-09 21:06:36 2016-06-09 21:23:23 1   3.10    -73.987106  40.733173   1   N   -73.985909  40.766445   1   13.5    0.5 0.5 2.96    0.0 0.3 17.76

This dataset has 11,135,470 rows, which isn't "big data," but isn't small. Rather than writing a function and applying it to every row, you'll get a lot more performance by performing parts of the function to individual columns. I would turn this function:

def toWebMercator(xLon, yLat):
    # Check if coordinate out of range for Latitude/Longitude
    if (abs(xLon) > 180) and (abs(yLat) > 90):
        return

    semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis
    east = xLon * 0.017453292519943295
    north = yLat * 0.017453292519943295

    northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
    easting = semimajorAxis * east

    return [easting, northing]

into this:

SEMIMAJORAXIS = 6378137.0 # typed in all caps since this is a static value
df['pickup_east'] = df['pickup_longitude'] * 0.017453292519943295 # takes all pickup longitude values, multiples them, then saves as a new column named pickup_east.
df['pickup_north'] = df['pickup_latitude'] * 0.017453292519943295
# numpy functions allow you to calculate an entire column's worth of values by simply passing in the column. 
df['pickup_northing'] = 3189068.5 * np.log((1.0 + np.sin(df['pickup_north'])) / (1.0 - np.sin(df['pickup_north']))) 
df['pickup_easting'] = SEMIMAJORAXIS * df['pickup_east']

You then have pickup_easting and pickup_northing columns with the calculated values.

For my laptop, this takes:

CPU times: user 1.01 s, sys: 286 ms, total: 1.3 s
Wall time: 763 ms

For all 11m rows. 15 minutes --> seconds.

I got rid of the checks on the values- you could do something like:

df = df[(df['pickup_longitude'].abs() <= 180) & (df['pickup_latitude'].abs() <= 90)]

This uses boolean indexing, which again, is orders of magnitude faster than looping.

Upvotes: 1

piRSquared
piRSquared

Reputation: 294526

try:

df[['longitude', 'latitude']].apply(
    lambda x: pd.Series(toWebMercator(*x), ['xLon', 'yLay']),
    axis=1
)

Upvotes: 1

Related Questions