Jake Slattery
Jake Slattery

Reputation: 87

Resources exceed limits big query

SELECT A,B, C, D, E, F ,EXTRACT(MONTH FROM PARSE_DATE('%b',Month)) 
as MonthNumber,PARSE_DATETIME(' %Y%b%d ', CONCAT(CAST(Year AS STRING),Month,'1'))  
as G FROM `XXX.YYY.ZZZ` 
where A !='null' and B = 'MYSTRING' order by A,Year

The query processes about 20 GB per run.

My table ZZZ has 396,567,431 (396 million) rows with a size of 53 GB. If I execute the above query without a LIMIT clause , I get an error saying "Resources exceeded".

If i execute it with a LIMIT clause , then it gives the same error for larger limits.

I am writing a python script using the API that runs the query above and then computes some metrics and then writes the output to another table. It writes some 1.7 million output rows, so basically aggregates the first table based on column A i.e original table has multiple rows for column A.

Now I know we can set Allow large results to on and select an output table to get around this error but for the purposes of my script it doesn't server the purpose.

Also , I read that order by is the expesnive part causing this but below is my algorithm and I dont see a way around order by.

Also my script pages the query results 100000 rows at a time.

log=[]
    while True:
        rows, total_rows, page_token = query_job.results.fetch_data(max_results=100000, page_token=page_token)
        for row in rows:
            try:
                lastAValue=log[-1][0]
            except IndexError:
                lastAValue=None

            if(lastAValue==None or row[0]==lastAValue):
                log.append(row)
            else:
                res=Compute(lastAValue,EntityType,lastAValue)
                allresults.append(res)
                log=[]
                log.append(row)
        if not page_token:
            break

I have two questions :

Column A | Column B ......

123 | NDG

123 | KOE

123 | TR

345 | POP

345 | KOP

345 | POL

The way I kept my logic is : I iterate through the rows and check if column A is same as last row column A. If same , then I add that row to an array. The moment I encounter a different column A i.e 345 , I send the first group of column A for processing , compute and add the data to my array. Based on this approach I had some questions :

1) I am effectively querying only once . So , I should be charged only for 1 query. Does big query charge as per totalRows/noOf pages ? i.e will individual pages from above code be separate query and charged separately ?

2) Assume page size in the above example would be 5 , what would happen is the 345 entries would be spread across pages , in this case will I lose information about the 6 th 345 -POL entry as it will be in a different page ? Is there a work around for this ?

3)Is there a direct way to get around the whole check the successive rows if they differ in values ? like a direct group by and get groups as array mechanism ? The above approach takes a couple of hours (estimated ) to run if i add a limit of 1 million.

4) How can I get around this error of Resources exceeded by specifying higher limits than 1 million.?

Upvotes: 2

Views: 1736

Answers (1)

Michael Entin
Michael Entin

Reputation: 7744

You are asking BigQuery to produce one huge sorted result, which BigQuery currently cannot efficiently parallelize, so you get the "Resources exceeded" error.

The efficient way to perform this kind of queries is to allow your computation to happen in SQL inside of BigQuery, rather than extracting huge result from it, and doing post-processing in Python. Analytical functions is common way to do what you described, if the Compute() function can be expressed in SQL.

E.g. for finding value of B in last row before A changes, you can find this row using LAST_VALUE function, something like

select LAST_VALUE(B) OVER(PARTITION BY A ORDER BY Yeah) from ...

If you could describe what Compute() does, we could try to fill details.

Upvotes: 0

Related Questions