Reputation: 4158
In the below google big query, I join two tables "Data" and "Location" on Id, StartTime and StopTime.
Since Data is partitioned by Date, I have the condition based on PartitionTime in the WHERE clauase.
The query runs for a very long time (~20 mins), just wondering whether i am missing some performace techniques for the query efficiency to be improved.
Any help would be appreciated. Thanks !!
SELECT
*
FROM (
SELECT
A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
FROM
`Data` AS A
JOIN
(SELECT * FROM `Location` WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )) AS B
ON
A.StartTime < B.DateTime
AND A.StopTime >= B.DateTime
AND A.Id = B.Id
WHERE
(A._PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01')
AND TIMESTAMP('2016-11-30'))
ORDER BY
B.Id,
A.Id1,
B.DateTime )
ORDER BY
Id,
Id1,
DateTime
Upvotes: 0
Views: 76
Reputation: 173190
I would also remove outer ORDER BY
as I think it is the main killer of performance for your query.
Moving _PARTITIONTIME
up to respective table is another item to consider.
Using SELECT *
in sub-selects doesn't affect performance and cost (as it is final outer SELECT which defines which columns are used in addition to those used in WHERE
and other clauses), but as a good practice I think better to list explicitly needed columns/fields
#standardSQL
SELECT
A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
FROM (
SELECT Id, Id1, StartTime, StopTime
FROM `Data`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01') AND TIMESTAMP('2016-11-30')
) AS A
JOIN (
SELECT Latitude, Longitude, DateTime
FROM `Location`
WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )
) AS B
ON A.StartTime < B.DateTime
AND A.StopTime >= B.DateTime
AND A.Id = B.Id
You may also consider "compressing" below statement as Elliott suggested,
WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )
but be careful as this can bring unwanted tables to be involved (in case if you have such in your dataset). like for example those with suffixes as '011' or '046' etc.
Yet another option is - you potentially have some logical relation between partitions in Data
and suffix in Location
. If so you can employ it to narrow down JOIN thus making it more performant
Upvotes: 0
Reputation: 33765
A couple of thoughts:
ORDER BY
, since only a top-level ORDER BY
has an effect on the results of the query."25"
, you can use _TABLE_SUFFIX BETWEEN "01" AND "31" AND _TABLE_SUFFIX != "25"
.JOIN
, the filter on _PARTITIONTIME
might not be "pushed down" to avoid reading extra data automatically, e.g. if you are actually using a RIGHT JOIN
. If this is the case, use a subquery such as (SELECT * FROM YourTable WHERE _PARTITIONTIME BETWEEN ...) AS A RIGHT JOIN ...
instead.If you'd like a BigQuery engineer to take a more detailed look at where the time went, you could include a sample job ID in your question and someone may be able to help.
Upvotes: 1