Reputation: 21
I am trying to scrape data from Google Patents with Beautiful Soup and add some columns to an existing csv. Here is an example of patent result. Here is my code:
with open ('patentdatacleaned.csv', 'r', encoding="ISO-8859-1") as csv_file:
csv_reader = csv.reader(csv_file)
next(csv_reader)
for line in csv_reader:
for row in line[13].split():
r = requests.get(row)
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class":"description"})
#with open('newpatentdata_class.csv', 'w', newline='', encoding="UTF-8") as write_obj:
# csv_writer = writer(write_obj)
for item in g_data:
print(item)
break
I managed this with the Claims, Description and Abstract, but I am not able to extract the Classification codes with the description. I tried various classes and div's and looked in detail at the child div's, but I can't find the problem. Please help.
Upvotes: 2
Views: 3142
Reputation: 195563
To get codes from the Google patent page, you can use this example:
import requests
from bs4 import BeautifulSoup
url = 'https://patents.google.com/patent/EP3017304B1/en'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for code in soup.select('[itemprop="Code"]:has(~ meta[itemprop="Leaf"])'):
print(code.text)
print(code.find_next('span').text)
print('-' * 80)
Prints:
G01N33/5438
Electrodes
--------------------------------------------------------------------------------
G01N27/3275
Sensing specific biomolecules, e.g. nucleic acid strands, based on an electrode surface reaction
--------------------------------------------------------------------------------
G01N33/5308
Immunoassay; Biospecific binding assay; Materials therefor for analytes not provided for elsewhere, e.g. nucleic acids, uric acid, worms, mites
--------------------------------------------------------------------------------
G01N33/5436
Immunoassay; Biospecific binding assay; Materials therefor with an insoluble carrier for immobilising immunochemicals with ligand physically entrapped within the solid phase
--------------------------------------------------------------------------------
G01N33/544
Immunoassay; Biospecific binding assay; Materials therefor with an insoluble carrier for immobilising immunochemicals the carrier being organic
--------------------------------------------------------------------------------
G01N33/9413
Dopamine
--------------------------------------------------------------------------------
G01N33/9446
Antibacterials
--------------------------------------------------------------------------------
G01N33/946
CNS-stimulants, e.g. cocaine, amphetamines
--------------------------------------------------------------------------------
G01N2333/78
Connective tissue peptides, e.g. collagen, elastin, laminin, fibronectin, vitronectin, cold insoluble globulin [CIG]
--------------------------------------------------------------------------------
EDIT: For status of the applications:
import requests
from bs4 import BeautifulSoup
url = 'https://patents.google.com/patent/EP3017304B1/en'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for application in soup.select('li[itemprop="application"]'):
print(application.select_one('[itemprop="countryCode"]').text)
print(application.select_one('[itemprop="applicationNumber"]').text)
print(application.select_one('[itemprop="legalStatus"]').text)
print('-' * 80)
Prints:
WO
PCT/EP2014/064249
Application Filing
--------------------------------------------------------------------------------
US
US14/901,760
Active
--------------------------------------------------------------------------------
EP
EP14737196.7A
Active
--------------------------------------------------------------------------------
EP
EP17184772.6A
Withdrawn
--------------------------------------------------------------------------------
ES
ES14737196.7T
Active
--------------------------------------------------------------------------------
US
US15/702,938
Active
--------------------------------------------------------------------------------
Upvotes: 2