Alona
Alona

Reputation: 67

Google Patents - scraping patent's publication numbers using Python and BigQuery

I need to get a big amount of publication numbers from Google Patents. The example of names that I need: US7863316B2, KR102121633B1. I was trying to scrape the data by using classic Python tools (like BeautifulSoup) but this method doesn't work with Google. Then I went to Google Cloud BigQuery and I've got some results. But before understanding well how to use this platform I've got an error: Quota exceeded: Your project exceeded quota for free query bytes scanned. The code I was using to get data:


  q = r'''
  WITH 
  pubs as (
    SELECT DISTINCT 
      pub.publication_number
    FROM `patents-public-data.patents.publications` pub
      INNER JOIN `patents-public-data.google_patents_research.publications` gpr ON
        pub.publication_number = gpr.publication_number
    WHERE 
      "epilepsy" IN UNNEST(gpr.top_terms)
      AND pub.grant_date < 20000101
  )

  SELECT
    publication_number, url
  FROM 
    `patents-public-data.google_patents_research.publications`
  WHERE
    publication_number in (SELECT publication_number from pubs)
    AND RAND() <= 1000/(SELECT COUNT(*) FROM pubs)
  '''

  return q

df = client.query(create_query(search_term)).to_dataframe()

if len(df) == 0:
  raise ValueError('No results for your search term. Retry with another term.')
else:
  print('Search complete for search term: \"{}\". {} random assets selected.'
  .format(search_term, len(df)))

embedding_dict = dict(zip(df.publication_number.tolist(), 
                          df.embedding_v1.tolist()))

df.head()```

Probably there are some other ways to get information I need?

Upvotes: 0

Views: 437

Answers (0)

Related Questions