sm7
sm7

Reputation: 61

Bigquery runs indefinitely

I have a query like this:

WITH A AS (
  SELECT id FROM db1.X AS d
  WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
),
B AS (
  SELECT id
  FROM db2.Y as t
  WHERE
    t.start <= TIMESTAMP(DATE_SUB(current_date(), INTERVAL 7 DAY))
    AND t.end >= TIMESTAMP(current_date())
)

SELECT * FROM A as d JOIN B as t on d.id = t.id;

db1.X has 1.6 Billion rows.
db2.Y has 15K rows.

db1.X is a materialized view on a bigger table. db2.Y is a table with source as a google sheet.

Issue

The query keeps running indefinitely. I had to cancel it when it reached about an hour, but one query which I left running went on for 6 hours and then timed-out without any further error.

The query used to run fine till 2nd Jan, After that I reran it on 9th Jan and it never ended. Both the tables are auto-populated so it is possible that they ran over some threshold during this time, but I could not find any such threshold value. (Similar fate of 3 other queries, same tables)

What's tried

  1. Removed join to use a WHERE IN. Still never ending.

  2. No operation works on A, but all work on B. For ex: SELECT count(*) from B; will work. It keeps on going for A. (But it works when the definition of B is removed)

  3. The above behaivour is replicated even when not using subqueries.

  4. A has 10.6 Million rows, B has 31 rows (Much less than actual table, but still the same result)

  5. The actual query was without any subqueries and used only multiple date comparisons while joining. So I used subqueries which filters data before going into the join. (This is the one above) But it also runs indefinitely

  6. JOIN EACH: This never got out of syntax error. Replacing JOIN with JOIN EACH in above query complains about the "AS", removing that it complains that I should use dataset.tablename, on fixing that it complains Expected end of input but got "."

Upvotes: 1

Views: 105

Answers (2)

sm7
sm7

Reputation: 61

It turns out that the table size is the problem. I created a smaller table and ran exactly the same queries, and that works. This was also expected because the query just stopped running one day. The only variable was the amount of data in source tables.

In my case, I needed the data every week, so I created a scheduled query to update the smaller table with only 1 month worth of data.

The smaller versions of the tables have:
db1.X: 40 million rows
db2.Y: 400 rows

Upvotes: 1

dumky
dumky

Reputation: 16

Not sure what's going on exactly in terms of issues due to size, but apart from some code clarity your query should run as expected. Am I correct in reading from your query that table A should return results within the last 7 days whereas table B should return results outside of the last 7 days? Some things you might try to make debugging easier.

  • Use BETWEEN and dates. E.g. WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
  • Use a backtick (`) for your FROM statement to prevent table name errors like the one you mentioned (expected end of input but got ".")
  • Limit your CTE instead of the outer query. The current limit in the outer query has no effect on computed data only on the output. E.g. to limit the source data from table A instead use:
WITH A AS (
SELECT id FROM `db1.X`
  WHERE DATE(date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
  LIMIT 10
)
...

Upvotes: 0

Related Questions