Nexus
Nexus

Reputation: 65

Optimize query in BigQuery

I'm trying to run a query in BigQuery but I receive “Resources exceeded during query execution”.

Actually, I have two tables :

Table user:
Id  | User   | Latitude  | Longitude
1   | 1      |  50.83    |  4.01
2   | 1      |  50.84    |  4.03
3   | 2      |  48.78    |  2.87
4   | 3      |  47.42    |  8.53
…
Table point_of_interest:
Id  | Latitude  | Longitude | Range   |  Tag  
1   |  50.81    |  3.98     |  0.05   |  a;b;c;d
2   |  50.85    |  4.03     |  0.025  |  a;c;e;f
3   |  40.80    |  3.87     |  0.04   |  a;d;g
4   |  47.42    |  8.57     |  0.08   |  b
…

The purpose is to join the tables to tag all user with the Latitude, Longitude and Range.

For that, I used that query :

SELECT
  u.User AS id,
  GROUP_CONCAT(poi.Tag) AS tag
FROM (
  SELECT
    u.User,
    poi.Tag,
  FROM
    [user] u
  CROSS JOIN
    [point_of_interest] poi
  WHERE
    u.Latitude BETWEEN poi.Latitude – poi.Range AND poi.Latitude + poi.Range
    AND
    u.Longitude BETWEEN poi.Longitude – poi.Range AND poi.Longitude + poi.Range )
GROUP BY
  id

The user table is currently 520 MB and the point_of_interest table is only 565 KB, but will probably grow in the time.

I want to know if there is a better way to fulfill this goal, and the best architecture for that.

EDIT:

I also tried using a range LEFT JOIN EACH however BigQuery only support equality statements after the ON keyword.

Upvotes: 3

Views: 363

Answers (2)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

For a similar question, I optimized it via generating keys for each row that can be used to avoid having to CROSS JOIN over the whole dataset.

http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html

StackO: How to improve performance of GeoIP query in BigQuery?

Btw, is this question related to the one posted later (Tag huge list of elements with lat/long with large list of geolocation data)?

Upvotes: 0

Pentium10
Pentium10

Reputation: 207912

You need to shard your tables and run multiple smaller queries.

Something like this:

SELECT * FROM table WHERE ABS(HASH(id) % 100) == 0
SELECT * FROM table WHERE ABS(HASH(id) % 100) == 1
SELECT * FROM table WHERE ABS(HASH(id) % 100) == 2
SELECT * FROM table WHERE ABS(HASH(id) % 100) == 3
...
SELECT * FROM table WHERE ABS(HASH(id) % 100) == 99

but you need to find a decent high number and not 100 as in my example, and write a piece of code that automates this for you. First try one shard manually with a decent margin of records in the shard.

You can WRITE_APPEND the results into the same destination table and store separately from raw data.

Upvotes: 1

Related Questions