Reputation: 67
I need to get a big amount of publication numbers from Google Patents.
The example of names that I need: US7863316B2, KR102121633B1.
I was trying to scrape the data by using classic Python tools (like BeautifulSoup) but this method doesn't work with Google. Then I went to Google Cloud BigQuery and I've got some results. But before understanding well how to use this platform I've got an error: Quota exceeded: Your project exceeded quota for free query bytes scanned.
The code I was using to get data:
q = r'''
WITH
pubs as (
SELECT DISTINCT
pub.publication_number
FROM `patents-public-data.patents.publications` pub
INNER JOIN `patents-public-data.google_patents_research.publications` gpr ON
pub.publication_number = gpr.publication_number
WHERE
"epilepsy" IN UNNEST(gpr.top_terms)
AND pub.grant_date < 20000101
)
SELECT
publication_number, url
FROM
`patents-public-data.google_patents_research.publications`
WHERE
publication_number in (SELECT publication_number from pubs)
AND RAND() <= 1000/(SELECT COUNT(*) FROM pubs)
'''
return q
df = client.query(create_query(search_term)).to_dataframe()
if len(df) == 0:
raise ValueError('No results for your search term. Retry with another term.')
else:
print('Search complete for search term: \"{}\". {} random assets selected.'
.format(search_term, len(df)))
embedding_dict = dict(zip(df.publication_number.tolist(),
df.embedding_v1.tolist()))
df.head()```
Probably there are some other ways to get information I need?
Upvotes: 0
Views: 437