Reputation: 393
I have two tables of size 13 GB and 1 GB. Table 1 has 12 Million rows * 1.6k columns and table two has 5k rows * 160 columns. I am performing an inner join on these two tables to find all the matching records like the example query below, but the query seems to be running for a very long time, for hours. The resulting join has a table size of 29 GB with 9 Million rows. How should I improve the run time of this operation? I did read clustering the tables improves performance. Is that the best option for this case?
SELECT
*
FROM
`myproject.table1` as t1
INNER JOIN
`myproject.table2` as t2
ON
(
t1.Camp1ID = t2.ID
OR t1.Camp2ID = t2.ID
OR t1.Camp3ID = t2.ID
OR t1.Camp4ID = t2.ID
OR t1.Camp5ID h = t2.ID
OR t1.Camp6ID = t2.ID
OR t1.Camp7ID = t2.ID
OR t1.Camp8ID = t2.ID
OR t1.Camp9ID = t2.ID
OR t1.Camp10ID = t2.ID
OR t1.Camp11ID = t2.ID
OR t1.Camp12ID = t2.ID
OR t1.Camp13ID = t2.ID
OR t1.Camp14ID = t2.ID
OR t1.Camp15ID = t2.ID
OR t1.Camp16ID = t2.ID
)
Where
t1.Date BETWEEN PARSE_DATE('%m/%d/%y', t2.StartDate) AND PARSE_DATE('%m/%d/%y', t2.EndDate)
Camp ID's are not unique in those columns, multiple rows have the same camp ID's.
Upvotes: 1
Views: 317
Reputation: 172984
Some improvements (BigQuery Standard SQL)
#standardSQL
SELECT * EXCEPT(parsed_StartDate, parsed_EndDate)
FROM `myproject.table1` as t1
INNER JOIN (SELECT * ,
PARSE_DATE('%m/%d/%y', t2.StartDate) parsed_StartDate,
PARSE_DATE('%m/%d/%y', t2.EndDate) parsed_EndDate
FROM `myproject.table2`
) as t2
ON t2.ID IN (t1.Camp1ID,t1.Camp2ID,t1.Camp3ID,t1.Camp4ID,t1.Camp5ID,t1.Camp6ID,t1.Camp7ID,t1.Camp8ID,t1.Camp9ID,t1.Camp10ID,t1.Camp11ID,t1.Camp12ID,t1.Camp13ID,t1.Camp14ID,t1.Camp15ID,t1.Camp16ID)
AND t1.Date BETWEEN parsed_StartDate AND parsed_EndDate
Upvotes: 2
Reputation: 67382
I did read clustering the tables improves performance. Is that the best option for this case?
If you're implying that you didn't set up proper indices, and do self-joins on a 13GB table, I think I found your problem. Add individual non-clustered indices on every one of the columns involved in this, those camp IDs and the ID field itself.
And after that, remove all hard coded camp ID columns, and set up a properly normalized one-to-many relationship with an intermediary table, and join through that instead.
Upvotes: 0