for each lat,lng in a dataframe loop through another dataframe and compare

Question

For each nodes' (lat,lng) want to see how many rents happen within 100m distance.

I have two dataframes one called "nodes_df":

    id               geocode         title      lng        lat
0   1   POINT(127.036077 37.490958) place1  19.036077   67.490958
1   2   POINT(127.03103 37.491231)  place2  167.031030  37.491231
2   3   POINT(127.030428 37.4925)   place3  147.630428  27.492500
3   4   POINT(127.029558 37.494329) place4  117.029558  17.494329
4   5   POINT(127.029326 37.495018) place5  147.529326  57.495018

and another called "rents_df":

                 geocode                        lng     lat
0   POINT(127.03580515559 37.493864399152)  127.035805  37.493864
1   POINT(127.03580515559 37.493864399152)  127.035805  37.493864
2   POINT(127.03580515559 37.493864399152)  127.035805  37.493864
3   POINT(127.03580515559 37.493864399152)  127.035805  37.493864
4   POINT(127.03580515559 37.493864399152)  127.035805  37.493864

what I want to do is for each (lat,lng) pair in a row from nodes_df I want to use it to compare with all (lat, lng) pair in rents_df and find out how many were within 100m distance.

this is my code:

def count_per_node(node_geocode, title):
    #within 100m boundary of node
    # compare node with all rents
    within_df = rents_df.loc[rents_df[['lat', 'lng']].apply(lambda x: haversine(x, node_geocode), axis=1) <= 0.1]

    return len(within_df)

# for each geocode of node, compare it
data = {}
for node in nodes_df["title"]:
    lat_lng_df = nodes_df.loc[nodes_df["title"] == node][["lat", "lng"]]
    node_geocode = (lat_lng_df.values[0][0], lat_lng_df.values[0][1])

    data[node] = count_per_node(node_geocode, node)

    print(data)

This does the job but I have large data and it will crash after an hour or something. Any help?

**Desired output : **

        title    number_of_rents_within_range
  0    place1             355
  1    place2             1000
  2    place3             3043
  3    place4             3094
  4    place5            230823

and so on...

Currently running code as follows:

rents_geocode = list(zip(rents_df.lat, rents_df.lng))
nodes_geocode = list(zip(nodes_df.lat, nodes_df.lng))
counts = []

for n in nodes_geocode:
    count = 0

    for r in rents_geocode:
        if haversine(n , r) <= 0.1:
            count += 1

    counts.append(count)

but has O(n^2) time complexity...

ResidentSleeper · Accepted Answer

You can use vectorized numpy version of haversine function link

km = 0.1
nodes_df['count'] = nodes_df.apply(lambda row: sum(haversine_np(row.lng,
                                                                row.lat,
                                                                rents_df.lng,
                                                                rents_df.lat)
                                                   < km),
                                   axis=1)

nodes_df

   id   title         lng        lat  count
0   1  place1   19.036077  67.490958      0
1   2  place2  167.031030  37.491231      0
2   3  place3  147.630428  27.492500      0
3   4  place4  117.029558  17.494329      0
4   5  place5  147.529326  57.495018      0

for each lat,lng in a dataframe loop through another dataframe and compare

Answers (1)

Related Questions