Reputation: 505
For each nodes' (lat,lng) want to see how many rents happen within 100m distance.
I have two dataframes one called "nodes_df":
id geocode title lng lat
0 1 POINT(127.036077 37.490958) place1 19.036077 67.490958
1 2 POINT(127.03103 37.491231) place2 167.031030 37.491231
2 3 POINT(127.030428 37.4925) place3 147.630428 27.492500
3 4 POINT(127.029558 37.494329) place4 117.029558 17.494329
4 5 POINT(127.029326 37.495018) place5 147.529326 57.495018
and another called "rents_df":
geocode lng lat
0 POINT(127.03580515559 37.493864399152) 127.035805 37.493864
1 POINT(127.03580515559 37.493864399152) 127.035805 37.493864
2 POINT(127.03580515559 37.493864399152) 127.035805 37.493864
3 POINT(127.03580515559 37.493864399152) 127.035805 37.493864
4 POINT(127.03580515559 37.493864399152) 127.035805 37.493864
what I want to do is for each (lat,lng) pair in a row from nodes_df I want to use it to compare with all (lat, lng) pair in rents_df and find out how many were within 100m distance.
this is my code:
def count_per_node(node_geocode, title):
#within 100m boundary of node
# compare node with all rents
within_df = rents_df.loc[rents_df[['lat', 'lng']].apply(lambda x: haversine(x, node_geocode), axis=1) <= 0.1]
return len(within_df)
# for each geocode of node, compare it
data = {}
for node in nodes_df["title"]:
lat_lng_df = nodes_df.loc[nodes_df["title"] == node][["lat", "lng"]]
node_geocode = (lat_lng_df.values[0][0], lat_lng_df.values[0][1])
data[node] = count_per_node(node_geocode, node)
print(data)
This does the job but I have large data and it will crash after an hour or something. Any help?
**Desired output : **
title number_of_rents_within_range
0 place1 355
1 place2 1000
2 place3 3043
3 place4 3094
4 place5 230823
and so on...
Currently running code as follows:
rents_geocode = list(zip(rents_df.lat, rents_df.lng))
nodes_geocode = list(zip(nodes_df.lat, nodes_df.lng))
counts = []
for n in nodes_geocode:
count = 0
for r in rents_geocode:
if haversine(n , r) <= 0.1:
count += 1
counts.append(count)
but has O(n^2) time complexity...
Upvotes: 0
Views: 46
Reputation: 2495
You can use vectorized numpy version of haversine function link
km = 0.1
nodes_df['count'] = nodes_df.apply(lambda row: sum(haversine_np(row.lng,
row.lat,
rents_df.lng,
rents_df.lat)
< km),
axis=1)
nodes_df
id title lng lat count
0 1 place1 19.036077 67.490958 0
1 2 place2 167.031030 37.491231 0
2 3 place3 147.630428 27.492500 0
3 4 place4 117.029558 17.494329 0
4 5 place5 147.529326 57.495018 0
Upvotes: 2