barambuk
barambuk

Reputation: 133

Bigquery - reduce query execution time

I have bigquery table with the following structure:

source | feature |  value |
---------------------------
s1     |    f1   |  HIGH  |
---------------------------
s2     |    f1   |  LOW   |
---------------------------
s2     |    f2   | MEDIUM |
.
.
.
---------------------------
s7000 |   f1000 |  HIGH   |

I would like to count how much sources appears for all permutations of features and values. Example result set:

feature  |  feature  | value  | value  |  count
-----------------------------------------------
    f1   |    f2     |  HIGH  | HIGH   |   7
-----------------------------------------------
    f1   |    f2     |  LOW   | HIGH   |  211
.
.
.
-----------------------------------------------
  f7000  |  f1000    |  HIGH  |MEDIUM |  123  

This is my sql query:

select     m1.feature as f1, m2.feature as f2, m1.value as v1, m2.value as v2, count(*)
from       matrix  m1 inner join matrix m2 on m1.source=m2.source
group by   m1.feature, m2.feature, m1.value, m2.value

The problem is very poor performance of bigquery in this case. Matrix table is not so big, about 7,000,000 records. I stopped the query execution after 7 minutes...

Could you help me to improve the performance of my query?

Upvotes: 0

Views: 319

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173190

Few recommendations:

  • you most likely don't want symmetric pairs - so should add something like
    WHERE m1.feature < m2.feature
  • in case if you have duplicates in initial matrix - you would want to pre-aggregate it first (before self-joining)
  • good chunk of time is spent to bring result of the query to UI, so you can use CREATE TABLE AS or INSERT INTO to suppress loading result to UI and rather get it into table that then you can use/query at your will

Upvotes: 3

Related Questions