Bigquery Innerjoin is running for very long time

I have two tables of size 13 GB and 1 GB. Table 1 has 12 Million rows * 1.6k columns and table two has 5k rows * 160 columns. I am performing an inner join on these two tables to find all the matching records like the example query below, but the query seems to be running for a very long time, for hours. The resulting join has a table size of 29 GB with 9 Million rows. How should I improve the run time of this operation? I did read clustering the tables improves performance. Is that the best option for this case?

SELECT  
* 
FROM 
`myproject.table1` as t1
INNER JOIN
`myproject.table2` as t2
ON
(
t1.Camp1ID = t2.ID
OR t1.Camp2ID = t2.ID
OR t1.Camp3ID = t2.ID
OR t1.Camp4ID = t2.ID
OR t1.Camp5ID h = t2.ID
OR t1.Camp6ID = t2.ID
OR t1.Camp7ID = t2.ID
OR t1.Camp8ID = t2.ID
OR t1.Camp9ID = t2.ID
OR t1.Camp10ID = t2.ID
OR t1.Camp11ID = t2.ID
OR t1.Camp12ID = t2.ID
OR t1.Camp13ID = t2.ID
OR t1.Camp14ID = t2.ID
OR t1.Camp15ID = t2.ID
OR t1.Camp16ID = t2.ID
)
Where
t1.Date BETWEEN  PARSE_DATE('%m/%d/%y', t2.StartDate) AND  PARSE_DATE('%m/%d/%y', t2.EndDate)

Camp ID's are not unique in those columns, multiple rows have the same camp ID's.

Upvotes: 1

Answers (2)

Mikhail Berlyant

Reputation: 172984

Some improvements (BigQuery Standard SQL)

#standardSQL
SELECT * EXCEPT(parsed_StartDate, parsed_EndDate)
FROM `myproject.table1` as t1
INNER JOIN (SELECT * , 
  PARSE_DATE('%m/%d/%y', t2.StartDate) parsed_StartDate, 
  PARSE_DATE('%m/%d/%y', t2.EndDate) parsed_EndDate 
  FROM `myproject.table2`
) as t2
ON t2.ID IN (t1.Camp1ID,t1.Camp2ID,t1.Camp3ID,t1.Camp4ID,t1.Camp5ID,t1.Camp6ID,t1.Camp7ID,t1.Camp8ID,t1.Camp9ID,t1.Camp10ID,t1.Camp11ID,t1.Camp12ID,t1.Camp13ID,t1.Camp14ID,t1.Camp15ID,t1.Camp16ID)
AND t1.Date BETWEEN parsed_StartDate AND parsed_EndDate

Upvotes: 2

Blindy

Reputation: 67382

I did read clustering the tables improves performance. Is that the best option for this case?

If you're implying that you didn't set up proper indices, and do self-joins on a 13GB table, I think I found your problem. Add individual non-clustered indices on every one of the columns involved in this, those camp IDs and the ID field itself.

And after that, remove all hard coded camp ID columns, and set up a properly normalized one-to-many relationship with an intermediary table, and join through that instead.

Upvotes: 0

Bigquery Innerjoin is running for very long time

Answers (2)

Related Questions