Reputation: 327
I'm using Google Patents Public Dataset to extract patent information about pesticides using the CPC code "A01N" PRESERVATION OF BODIES OF HUMANS OR ANIMALS OR PLANTS OR PARTS THEREOF
But while I run the following Query, I don't obtain all the results as If I run a non-unnest query. See both below
SELECT patents.country_code, count(DISTINCT publication_number) as quantity
FROM
`patents-public-data.patents.publications` AS patents,
UNNEST(title_localized) AS title,
UNNEST(abstract_localized) AS abstract,
UNNEST(cpc) AS cpc_code,
UNNEST(inventor_harmonized) AS inventor,
UNNEST(assignee_harmonized) AS assignee
WHERE
cpc_code.code LIKE '%A01N%'
GROUP BY patents.country_code
ORDER BY quantity DESC
Row | country_code | quantity |
---|---|---|
1 | US | 67280 |
2 | CN | 59067 |
3 | WO | 39560 |
4 | EP | 37886 |
5 | CA | 23115 |
SELECT patents.country_code, count(DISTINCT publication_number) as quantity
FROM
`patents-public-data.patents.publications` AS patents,
UNNEST(cpc) AS cpc_code
WHERE
cpc_code.code LIKE '%A01N%'
GROUP BY country_code
ORDER BY quantity DESC
Row | country_code | quantity |
---|---|---|
1 | US | 77056 |
2 | CN | 70654 |
3 | EP | 60291 |
4 | WO | 39624 |
5 | JP | 36135 |
I don't understand why there is a difference between both results as no more filters were applied?
How can I maintain all the rows when I UNNEST more columns?
Upvotes: 1
Views: 165
Reputation: 3464
Some of the rows in the columns that you are UNNEST
ing are empty arrays. When you do the implicit CROSS JOIN
, you're joining on a NULL
which gives you no results, so some of those rows disappear and your count is lower. If you start commenting out some of those CROSS JOIN
s you'll see your count start to go up. Since you're not actually using those columns, you should remove them from your query. If you want them later, get the counts first, and then LEFT JOIN
on something like 1=1
.
Upvotes: 2