user6972
user6972

Reputation: 911

What is a faster way to match rows in a data frame and remove unmatched rows?

I have a dataframe with time, latitude, longitude, elevation, speed and I'm using shapely to reduce the data set based on a tolerance to smooth the latitude/longitude pairs. It works fine, but when I try to match up the smoothed simplified version of data points (lat, lon) with the original data frame with Time, Elevation, elements, it takes too long when the data points are > 500.

Essentially what I am doing is looping through the original dataset and finding matching pairs, recording the index until they are all matched. I'm using a "last_find" variable to speed the search a little because the points are almost always sequential and there's no reason to re-search from the beginning. FWIW, I've never seen it need to fall back to (last_find=0) a full data frame scan on my test datasets which makes sense based on the sequential line nature of the data and smoothing method.

        lon = pd.Series(pd.Series(simplified_line.coords.xy)[1])
        lat = pd.Series(pd.Series(simplified_line.coords.xy)[0])

        si = pd.DataFrame({'Longitude': lon, 'Latitude': lat})
        si.tail()

        si['df_index'] = None
        pd.options.mode.chained_assignment = None  # default='warn', suppress warning during copying dataframe
        last_find = 0  # assume data is sequential and and start search at last point found to reduce iterations
        for si_i, si_row in si.iterrows():
            si_coords = (si_row['Latitude'], si_row['Longitude'])
            found = False
            for df_i, df_row in islice(track.iterrows(), last_find, None):
                if si_coords == (df_row['Latitude'], df_row['Longitude']):
                    si['df_index'][si_i] = df_i
                    last_find = df_i
                    found = True
                    break
            if not found:
                last_find = 0
                # Rescanning full dataset for match
                for df_i, df_row in islice(track.iterrows(), last_find, None):
                    if si_coords == (df_row['Latitude'], df_row['Longitude']):
                        si['df_index'][si_i] = df_i
                        last_find = df_i
                        break

        rs = track.loc[si['df_index'].dropna()]

This process of rebuilding the dataframe into "rs" is very slow. (22 seconds for just 500 points). Is there a better way to do this type of matching to reduce the original data frame size?

Here is a full example for examination:

import pandas as pd
from pandas import DataFrame
from shapely.geometry import LineString
from time import time
from itertools import islice
import datetime


class RDP:

    def __init__(self, tracks, tolerance=0.000002):

        self.df = tracks
        self.tolerance = tolerance
        return

    def smooth(self):
        """
        Smooths list of data frames
        :return: list of smoothed data frames
        """

        results = []
        start_time = time()
        for track in self.df:

            coordinates = track.as_matrix(columns=['Latitude', 'Longitude'])
            line = LineString(coordinates)
            # If preserve topology is set to False, the method will use the Ramer-Douglas-Peucker algorithm
            simplified_line = line.simplify(self.tolerance, preserve_topology=False)

            lon = pd.Series(pd.Series(simplified_line.coords.xy)[1])
            lat = pd.Series(pd.Series(simplified_line.coords.xy)[0])

            si = pd.DataFrame({'Longitude': lon, 'Latitude': lat})
            si.tail()

            si['df_index'] = None
            pd.options.mode.chained_assignment = None  # default='warn', suppress warning during copying dataframe
            last_find = 0  # assume data is sequential and and start search at last point found to reduce iterations
            for si_i, si_row in si.iterrows():
                si_coords = (si_row['Latitude'], si_row['Longitude'])
                found = False
                for df_i, df_row in islice(track.iterrows(), last_find, None):
                    if si_coords == (df_row['Latitude'], df_row['Longitude']):
                        si['df_index'][si_i] = df_i
                        last_find = df_i
                        found = True
                        break
                if not found:
                    last_find = 0
                    # Rescanning full dataset for match
                    for df_i, df_row in islice(track.iterrows(), last_find, None):
                        if si_coords == (df_row['Latitude'], df_row['Longitude']):
                            si['df_index'][si_i] = df_i
                            last_find = df_i
                            break

            rs = track.loc[si['df_index'].dropna()]
            results.append(rs)
            print('process took %s seconds' % round(time() - start_time, 2))
        return results


if __name__ == "__main__":
    data = [[-155.05156, 19.73201, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 9), None, 0],
            [-155.05156, 19.73201, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 10), 0.0, 0.0],
            [-155.05156, 19.73201, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 11), 1.8244950963755258, 0.0],
            [-155.05157, 19.73202, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 12), 1.4678475295952227,
             1.527543187532957],
            [-155.05157, 19.73203, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 13), 1.11120000035271,
             1.1122983328025196],
            [-155.05157, 19.73203, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 14), 2.3687194876712123, 0.0],
            [-155.05159, 19.73204, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 15), 1.7399596859879076,
             2.3710607190787623],
            [-155.05159, 19.73205, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 16), 1.7399596281612155,
             1.112298332448747],
            [-155.05161, 19.73206, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 17), 1.7399595703344959,
             2.3710604875433656],
            [-155.05161, 19.73207, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 18), 1.7399595111950648,
             1.112298332448747],
            [-155.05163, 19.73208, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 19), 2.096606870194645,
             2.3710602536550747],
            [-155.05164, 19.73209, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 20), 1.6752646424182498,
             1.527542875149723],
            [-155.05165, 19.7321, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 21), 2.289051665826317,
             1.5275428299682154],
            [-155.05167, 19.73212, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 22), 2.754150510219822,
             3.055085523596321],
            [-155.05168, 19.73214, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 23), 2.4562322562443732,
             2.458660072750598],
            [-155.05169, 19.73216, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 24), 2.4124750922364004,
             2.458660017743196],
            [-155.05171, 19.73217, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 25), 2.9312779133573135,
             2.3710592140947706],
            [-155.05172, 19.7322, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 26), 2.9312777104723176,
             3.497291307909982],
            [-155.05174, 19.73221, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 27), 2.7103926533029608,
             2.3710587533735854],
            [-155.05176, 19.73223, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 28), 2.7541498631496495,
             3.0550845355246805],
            [-155.05177, 19.73225, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 29), 3.214644630731547,
             2.4586597654103666],
            [-155.0518, 19.73227, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 30), 3.272104136727811,
             3.8489512292988133],
            [-155.05182, 19.73228, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 31), 2.4699338807004922,
             2.3710579406395524],
            [-155.05184, 19.73229, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 32), 2.710391831502253,
             2.3710578250365275],
            [-155.05186, 19.73231, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 33), 2.571984173054396,
             3.055083816479077],
            [-155.05188, 19.73231, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 34), 2.2303087951040954,
             2.0939690200246193],
            [-155.0519, 19.73232, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 35), 2.412473942713288,
             2.371057475376574],
            [-155.05191, 19.73234, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 36), 2.892513413572397,
             2.458659515345401],
            [-155.05194, 19.73235, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 37), 3.1904307936770424,
             3.3320852845152014],
            [-155.05196, 19.73237, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 38), 3.190430517344615,
             3.0550832771287606],
            [-155.05199, 19.73238, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 39), 3.1904303798788662,
             3.332084723430405],
            [-155.05201, 19.7324, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 40), 2.710390701811524,
             3.055083009665372],
            [-155.05203, 19.73241, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 41), 2.3687150928988454,
             2.3710564358044426],
            [-155.05205, 19.73242, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 42), 2.4699323240615967,
             2.371056320034746],
            [-155.05207, 19.73243, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 43), 2.4699322129590935,
             2.371056201746366],
            [-155.05209, 19.73244, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 44), 3.1708845562819272,
             2.3710560884951897],
            [-155.05212, 19.73246, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 45), 3.1708842854814536,
             3.8489481826678027],
            [-155.05214, 19.73247, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 46), 2.710389981744121,
             2.371055741019484],
            [-155.05216, 19.73249, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 47), 2.710389776420281,
             3.0550821973706315],
            [-155.05218, 19.7325, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 48), 2.710389674730764,
             2.3710553940411665],
            [-155.0522, 19.73252, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 49), 3.0520651178491267,
             3.055081929390566],
            [-155.05222, 19.73254, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 50), 2.7103892638045273,
             3.0550817499490517],
            [-155.05224, 19.73255, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 51), 2.710389160077225,
             2.371054817376382],
            [-155.05226, 19.73257, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 52), 2.7103889546600244,
             3.0550814785741593],
            [-155.05228, 19.73258, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 53), 3.1069293021008604,
             2.3710544675462026],
            [-155.05231, 19.7326, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 54), 2.6855886784059737,
             3.8489459341931775],
            [-155.05232, 19.73261, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 55), 2.2890482314811806,
             1.5275405390797137],
            [-155.05234, 19.73263, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 56), 3.0520641294586754,
             3.055080939475094],
            [-155.05236, 19.73265, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 57), 3.052063950954796,
             3.0550807613433832],
            [-155.05238, 19.73267, 23.0, datetime.datetime(2017, 12, 28, 17, 50, 58), 3.131887830893072,
             3.0550805819008096],
            [-155.0524, 19.73269, 22.0, datetime.datetime(2017, 12, 28, 17, 50, 59), 2.79021178709455,
             3.0550804024580738],
            [-155.05242, 19.7327, 22.0, datetime.datetime(2017, 12, 28, 17, 51), 2.710387619443569, 2.371053080813694],
            [-155.05244, 19.73272, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 1), 2.7103874140249378,
             3.0550801308237463],
            [-155.05246, 19.73273, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 2), 2.790211385044462,
             2.3710527309810545],
            [-155.05248, 19.73275, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 3), 2.7902111840874317,
             3.0550798630984697],
            [-155.0525, 19.73276, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 4), 2.7103870033308333,
             2.3710523836665467],
            [-155.05252, 19.73278, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 5), 3.4486025267361144,
             3.0550795919786187],
            [-155.05255, 19.7328, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 6), 2.685586626408571,
             3.848942726844205],
            [-155.05256, 19.73281, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 7), 2.3688709766335045,
             1.5275396387506783],
            [-155.05258, 19.73283, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 8), 2.790210380257448,
             3.055079144293523],
            [-155.0526, 19.73284, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 9), 2.7902102788103615,
             2.37105145984478],
            [-155.05262, 19.73286, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 10), 3.211710268550555,
             3.0550788729150917],
            [-155.05264, 19.73288, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 11), 3.1318859927435865,
             3.0550786952967113],
            [-155.05266, 19.7329, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 12), 2.71038556626744,
             3.0550785153370557],
            [-155.05268, 19.73291, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 13), 2.3687093086562108,
             2.3710506470866193],
            [-155.0527, 19.73292, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 14), 2.469926768769237,
             2.3710505291274866],
            [-155.05272, 19.73293, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 15), 3.272095501921749,
             2.3710504155416987],
            [-155.05275, 19.73295, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 16), 3.170877650790823,
             3.8489403198357444],
            [-155.05277, 19.73296, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 17), 2.3687087304575547,
             2.371050068389868],
            [-155.05279, 19.73297, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 18), 2.3687086147991403,
             2.3710499502645015],
            [-155.05281, 19.73298, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 19), 2.790208873068198,
             2.3710498368443993],
            [-155.05283, 19.733, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 20), 2.7902086721088404,
             3.0550776168243536],
            [-155.05285, 19.73301, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 21), 2.3687081533850103,
             2.3710494917124847],
            [-155.05287, 19.73302, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 22), 2.368708036413647,
             2.3710493735868456],
            [-155.05289, 19.73303, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 23), 2.71038422971785,
             2.3710492556271126],
            [-155.05291, 19.73305, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 24), 2.7541450397296856,
             3.055077167566475],
            [-155.05292, 19.73307, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 25), 2.412468561327511,
             2.4586574769885563],
            [-155.05294, 19.73308, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 26), 2.368707343773043,
             2.371048681300471],
            [-155.05296, 19.73309, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 27), 2.7103836137458464,
             2.3710485633404104],
            [-155.05298, 19.73311, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 28), 2.7103834083234952,
             3.0550766266297775],
            [-155.053, 19.73312, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 29), 2.710383306631016,
             2.3710482158540307],
            [-155.05302, 19.73314, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 30), 3.190419933693602,
             3.0550763588997274],
            [-155.05305, 19.73315, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 31), 2.8487433306642,
             3.332070455791797],
            [-155.05307, 19.73316, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 32), 2.710382894565474,
             2.371047750738638],
            [-155.05309, 19.73318, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 33), 2.811600384514776,
             3.055075999491225],
            [-155.05311, 19.73319, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 34), 2.5711414689310477,
             2.3710474054378037],
            [-155.05313, 19.7332, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 35), 2.811600187179107,
             2.371047289829799],
            [-155.05315, 19.73322, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 36), 3.190418832907143,
             3.055075638256173],
            [-155.05318, 19.73323, 23.0, datetime.datetime(2017, 12, 28, 17, 51, 37), 2.9903853616785745,
             3.3320689728999744],
            [-155.05319, 19.73325, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 38), 2.5103487005418743,
             2.4586569745708617],
            [-155.05321, 19.73326, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 39), 2.710381868668547,
             2.3710465975375103],
            [-155.05323, 19.73328, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 40), 2.7103816619319017,
             3.0550751009671164],
            [-155.05325, 19.73329, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 41), 2.3687049136083917,
             2.3710462453430305],
            [-155.05327, 19.7333, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 42), 2.3687047992608443,
             2.371046131921171],
            [-155.05329, 19.73331, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 43), 2.710381355108257,
             2.3710460163125644],
            [-155.05331, 19.73333, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 44), 2.7103811483711353,
             3.0550746517035354],
            [-155.05333, 19.73334, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 45), 2.368704335399346,
             2.371045666469919],
            [-155.05335, 19.73335, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 46), 3.1069186773136934,
             2.3710455533797066],
            [-155.05338, 19.73337, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 47), 2.685580778436882,
             3.8489335801226137],
            [-155.05339, 19.73338, 22.0, datetime.datetime(2017, 12, 28, 17, 51, 48), 2.427402283410334,
             1.5275370795403593]]
    columns = ['Longitude', 'Latitude', 'Altitude', 'Time', 'Speed',
               'Distance']
    df = list()
    df.append(DataFrame(data, columns=columns))
    rdp = RDP(df)
    print(rdp.smooth())

Upvotes: 1

Views: 356

Answers (1)

Gabriel A
Gabriel A

Reputation: 1827

The hardest part was understanding what you were asking for. This is equivalent to all of your code starting at the first for loop.

rs = si.merge( track, on = ["Latitude", "Longitude"] )

You're basically just merging two dataframes based on 2 columns. This merge defaults to an inner merge, which will only keep the rows were a match is found in both.

Upvotes: 1

Related Questions