Racana
Racana

Reputation: 327

Google Bigquery query Record data - Patents

I'm using Google Patents Public Dataset to extract patent information about pesticides using the CPC code "A01N" PRESERVATION OF BODIES OF HUMANS OR ANIMALS OR PLANTS OR PARTS THEREOF

But while I run the following Query, I don't obtain all the results as If I run a non-unnest query. See both below

SELECT patents.country_code, count(DISTINCT publication_number) as quantity
FROM
  `patents-public-data.patents.publications` AS patents,
  UNNEST(title_localized) AS title,
  UNNEST(abstract_localized) AS abstract,
  UNNEST(cpc) AS cpc_code,
  UNNEST(inventor_harmonized) AS inventor,
  UNNEST(assignee_harmonized) AS assignee
WHERE 
  cpc_code.code LIKE '%A01N%'
GROUP BY patents.country_code
ORDER BY quantity DESC
Row country_code quantity
1 US 67280
2 CN 59067
3 WO 39560
4 EP 37886
5 CA 23115
SELECT patents.country_code, count(DISTINCT publication_number) as quantity
FROM
  `patents-public-data.patents.publications` AS patents,
  UNNEST(cpc) AS cpc_code
WHERE 
  cpc_code.code LIKE '%A01N%'
GROUP BY country_code
ORDER BY quantity DESC
Row country_code quantity
1 US 77056
2 CN 70654
3 EP 60291
4 WO 39624
5 JP 36135

I don't understand why there is a difference between both results as no more filters were applied?
How can I maintain all the rows when I UNNEST more columns?

Upvotes: 1

Views: 165

Answers (1)

Daniel
Daniel

Reputation: 3464

Some of the rows in the columns that you are UNNESTing are empty arrays. When you do the implicit CROSS JOIN, you're joining on a NULL which gives you no results, so some of those rows disappear and your count is lower. If you start commenting out some of those CROSS JOINs you'll see your count start to go up. Since you're not actually using those columns, you should remove them from your query. If you want them later, get the counts first, and then LEFT JOIN on something like 1=1.

Upvotes: 2

Related Questions