Felipe Carlo
Felipe Carlo

Reputation: 349

Poor performance to calculate distances with latitude and longitude

I'm trying to get the quantity of people in 2km or less from specific locations, to do that I'm calculating distances with latitude and longitude. In one table I just have latitude, longitude and in other I have more fields but with latitude and longitude as well.

The query is valid and will process 12.3 MB when run.

The query I'm using is:

select 
e.lat,
e.long,
e.searches,
count(distinct l.id)
from dataset.table1 e
join dataset.table2 l 
     on 6371000*ACOS(COS(3.14159265359/180*(90-e.lat)) *COS(3.14159265359/180*(90-l.lat)) +SIN(3.14159265359/180*(90-e.lat)) *SIN(3.14159265359/180*(90-l.lat)) *COS(3.14159265359/180*(e.long-l.long))) <= 2000 # way to calculate distance from lats and longs
group by e.lat,
e.long,
e.searches

But the query is not running, it takes more than 15 minutes every time and I have to cancel.

What could be the problem?

Upvotes: 1

Views: 650

Answers (2)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

This query, similar to your original one, takes 2 minutes:

SELECT distance, COUNT(*) FROM (
SELECT
  e.lat,
  e.long
  , 6371000*ACOS(COS(3.14159265359/180*(90-e.lat)) *COS(3.14159265359/180*(90-l.lat)) +SIN(3.14159265359/180*(90-e.lat)) *SIN(3.14159265359/180*(90-l.lat)) *COS(3.14159265359/180*(e.long-l.long))) <= 2000 distance
  , e.long-l.long longlong, e.lat-l.lat latlat
FROM
  `buoyant-history-159518.test_lat_long.table1` e
JOIN
  `buoyant-history-159518.test_lat_long.table1` l
ON 
 (COS(3.14159265359/180*(90-e.lat)) *COS(3.14159265359/180*(90-l.lat)) +SIN(3.14159265359/180*(90-e.lat)) *SIN(3.14159265359/180*(90-l.lat)) *COS(3.14159265359/180*(e.long-l.long))) <= COS(2000/6371000) + 4.5E-8
)
GROUP BY distance

To prevent floating point errors, I had to transform the JOIN inequality:

6371000*ACOS(COS(3.14159265359/180*(90-e.lat)) *COS(3.14159265359/180*(90-l.lat)) +SIN(3.14159265359/180*(90-e.lat)) *SIN(3.14159265359/180*(90-l.lat)) *COS(3.14159265359/180*(e.long-l.long))) 
<= 2000

to the similar:

(COS(3.14159265359/180*(90-e.lat)) *COS(3.14159265359/180*(90-l.lat)) +SIN(3.14159265359/180*(90-e.lat)) *SIN(3.14159265359/180*(90-l.lat)) *COS(3.14159265359/180*(e.long-l.long))) <= COS(2000/6371000) + 4.5E-8)
<= COS(2000/6371000) + 4.5E-8

Now the question is how can we get a much better performance than 2 minutes? Let's add to the JOIN some 'sanity' filters -> No 2 points in the same area can have a distance between lats and long more than 0.something:

SELECT distance, COUNT(*) FROM (
SELECT
  e.lat,
  e.long
  , (COS(3.14159265359/180*(90-e.lat)) *COS(3.14159265359/180*(90-l.lat)) +SIN(3.14159265359/180*(90-e.lat)) *SIN(3.14159265359/180*(90-l.lat)) *COS(3.14159265359/180*(e.long-l.long))) <= COS(2000/6371000) distance
  , e.long-l.long longlong, e.lat-l.lat latlat
FROM
  `buoyant-history-159518.test_lat_long.table1` e
JOIN
  `buoyant-history-159518.test_lat_long.table1` l
ON 
 NOT (e.long=l.long AND e.lat=l.lat) 
 AND ABS(e.long-l.long) < 0.021 #sanity JOIN check
 AND ABS(e.lat-l.lat) < 0.018 #sanity JOIN check
)
GROUP BY distance

With this we get very similar results, but in 12 seconds instead of 2 minutes.

I can't optimize your exact query, as your sample tables don't have the same number or rows nor columns - but trying applying those "sanity JOIN checks" before going for the full CROSS JOIN.

Upvotes: 5

Mosha Pasumansky
Mosha Pasumansky

Reputation: 13994

Doing this kind of brute force analysis using JOIN on predicate for distance is not going to have good performance. BigQuery team is looking into adding better support for geospatial analysis (i.e. using ST_DWithin function in JOIN predicate). In the meantime, droping into Postgres and using PostGIS extension is probably your best bet.

Upvotes: 2

Related Questions