user3447653
user3447653

Reputation: 4158

Performance enhancement in google big query sql

In the below google big query, I join two tables "Data" and "Location" on Id, StartTime and StopTime.

Since Data is partitioned by Date, I have the condition based on PartitionTime in the WHERE clauase.

The query runs for a very long time (~20 mins), just wondering whether i am missing some performace techniques for the query efficiency to be improved.

Any help would be appreciated. Thanks !!

  SELECT
    *
  FROM (
      SELECT
          A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
      FROM
          `Data` AS A
      JOIN
        (SELECT * FROM `Location` WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
        "19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )) AS B
      ON
        A.StartTime < B.DateTime
        AND A.StopTime >= B.DateTime
        AND A.Id = B.Id
  WHERE
    (A._PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01')
      AND TIMESTAMP('2016-11-30'))
  ORDER BY
    B.Id,
    A.Id1,
    B.DateTime )
ORDER BY
  Id,
  Id1,
  DateTime

Upvotes: 0

Views: 76

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173190

I would also remove outer ORDER BY as I think it is the main killer of performance for your query.
Moving _PARTITIONTIME up to respective table is another item to consider.
Using SELECT * in sub-selects doesn't affect performance and cost (as it is final outer SELECT which defines which columns are used in addition to those used in WHERE and other clauses), but as a good practice I think better to list explicitly needed columns/fields

#standardSQL
SELECT
  A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
FROM (
  SELECT Id, Id1, StartTime, StopTime 
  FROM `Data` 
  WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01') AND TIMESTAMP('2016-11-30')
) AS A
JOIN (
  SELECT Latitude, Longitude, DateTime 
  FROM `Location` 
  WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )
) AS B
ON  A.StartTime < B.DateTime
AND A.StopTime >= B.DateTime
AND A.Id = B.Id   

You may also consider "compressing" below statement as Elliott suggested,

WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )  

but be careful as this can bring unwanted tables to be involved (in case if you have such in your dataset). like for example those with suffixes as '011' or '046' etc.

Yet another option is - you potentially have some logical relation between partitions in Data and suffix in Location. If so you can employ it to narrow down JOIN thus making it more performant

Upvotes: 0

Elliott Brossard
Elliott Brossard

Reputation: 33765

A couple of thoughts:

  • There is no need for the inner ORDER BY, since only a top-level ORDER BY has an effect on the results of the query.
  • If you want to query over all suffixes except "25", you can use _TABLE_SUFFIX BETWEEN "01" AND "31" AND _TABLE_SUFFIX != "25".
  • Depending on the type of JOIN, the filter on _PARTITIONTIME might not be "pushed down" to avoid reading extra data automatically, e.g. if you are actually using a RIGHT JOIN. If this is the case, use a subquery such as (SELECT * FROM YourTable WHERE _PARTITIONTIME BETWEEN ...) AS A RIGHT JOIN ... instead.

If you'd like a BigQuery engineer to take a more detailed look at where the time went, you could include a sample job ID in your question and someone may be able to help.

Upvotes: 1

Related Questions