Guus
Guus

Reputation: 329

Google BigQuery nested select subquery in cross join

I have the following code:

SELECT ta.application as koekkoek, ta.ipc, ipc_count/ipc_tot as ipc_share, t3.sfields FROM (
  select t1.appln_id as application, t1.ipc_subclass_symbol as ipc, count(t2.appln_id) as ipc_count, sum(ipc_count) over (PARTITION BY application) as ipc_tot
    FROM temp.tls209_small t1
    CROSS JOIN 
   (SELECT appln_id, FROM temp.tls209_small group by appln_id )  t2 
      where t1.appln_id = t2.appln_id
        GROUP BY application, ipc
      ) as ta
  CROSS JOIN thesis.ifris_ipc_concordance t3
  WHERE ta.ipc LIKE t3.ipc+'%'
    AND ta.ipc NOT LIKE t3.not_ipc+'%'
    AND t3.not_appln_id NOT IN 
       (SELECT ipc_subclass_symbol from temp.tls209_small t5 where t5.appln_id = ta.application)

Giving the folllowing error:
Field 'ta.application' not found.

I have tried numerous notations for the field, but BigQuery doesn't seem to recognize any reference to other tables in the subquery.

The purpose of the code is as to assign new technology classifications to records based on a concordance table:

I have got two tables: One large table with application id's, classifications and some other stuff tls209_small:
tls209_small

And a concordance table with some exception rules ifris_ipc_concordance: ifris_ipc_concordance

In the end I need to assign the sfields label for each row in tls209 (300 million rows). The rules are that ipc_class_symbol+'%' from the first table should be like ipcin the second table, but not like not_ipc. In addition, the not_appln_id value, if present, should not be associated with the same appln_id in the first table.

So a small example, say this is the input of the query:

appln_id | ipc_class_symbol
1        | A1
1        | A2
1        | A3
1        | C3

sfields | ipc | not_ipc | not_appln_id
X       | A   | A2      | null
Y       | A   | null    | A3

appln_id 1 should get two times sfields X because ipc=A, not_ipc matches A1 and A3. Y should not be assigned at all as A3 occurs in appln_id 1.

In the results, I also need the share of the ipc_class_symbol for a single application (1 for 328100001, 0.5 for 32100009 etc.)

Without the last condition (AND t3.not_appln_id NOT IN (SELECT ipc_subclass_symbol from temp.tls209_small t5 where t5.appln_id = ta.application) ) the query works fine: results

Any suggestions on how to get the subquery to recognize the application id (ta.application), or other ways to introduce the last condition to the query?

I realize my explanation of the problem may not be very straightforward, so if anything is not clear please indicate so, I'll try to clarify the issues.

Upvotes: 1

Views: 1290

Answers (2)

Guus
Guus

Reputation: 329

A working solution for the problem was achieved by first generating a table my matching only the ipc_class_symbol from the first table, to the ipc column of the second, but also including the not_ipc, and not_appln_id columns from the second. In addition, a list of all ipc class labels assigned to each appln_id was added using the GROUP_CONCAT method.

Finally, with help from Pentium10, the resulting table was filtered based on the exeption rules as also discussed in this question.

In the final query, the GROUP BY and JOIN arguments needed EACH modifiers to allow the large tables to be processed:

SELECT application as appln_id, ipc as ipc_class, ipc_share, sfields as ifris_class FROM (
  SELECT * FROM (
    SELECT ta.application as application, ta.ipc as ipc, ipc_count/ipc_tot as ipc_share, t3.sfields as sfields, t3.ipc as yes_ipc, t3.not_ipc as not_ipc, t3.not_appln_id as exclude, t4.classes as other_classes FROM (
      SELECT t1.appln_id as application, t1.ipc_class_symbol as ipc, count(t2.appln_id) as ipc_count, sum(ipc_count) over (PARTITION BY application) as ipc_tot
        FROM thesis.tls209_appln_ipc t1

        FULL OUTER JOIN EACH
          (SELECT appln_id, FROM thesis.tls209_appln_ipc GROUP EACH BY appln_id )  t2 
            ON t1.appln_id = t2.appln_id
             GROUP EACH BY application, ipc
          ) AS ta


        LEFT JOIN EACH (
          SELECT appln_id, GROUP_CONCAT(ipc_class_symbol) as classes FROM [thesis.tls209_appln_ipc] 
            GROUP EACH BY appln_id) t4
        ON ta.application = t4.appln_id

        CROSS JOIN  thesis.ifris_ipc_concordance t3
        WHERE ta.ipc CONTAINS t3.ipc
  ) as tx
   WHERE (not ipc contains not_ipc or not_ipc is null) 
   AND (not other_classes contains exclude or exclude is null or other_classes is null)
)

Upvotes: 1

Pentium10
Pentium10

Reputation: 207828

The query you're performing is doing an anti-join. You can re-write this as an explicit join, but it is a little verbose:

SELECT * 
FROM [x.z] as z
LEFT OUTER JOIN EACH [x.y] as y ON y.appln_id = z.application
WHERE y.not_appln_id is NULL

Upvotes: 1

Related Questions