eliza01
eliza01

Reputation: 21

Google patents scraping with Beautiful Soup

I am trying to scrape data from Google Patents with Beautiful Soup and add some columns to an existing csv. Here is an example of patent result. Here is my code:

with open ('patentdatacleaned.csv', 'r', encoding="ISO-8859-1") as csv_file:
    csv_reader = csv.reader(csv_file)
    next(csv_reader)
    for line in csv_reader:
        for row in line[13].split():
            r = requests.get(row)
        soup = BeautifulSoup(r.content)
        g_data = soup.find_all("div", {"class":"description"})
        #with open('newpatentdata_class.csv', 'w', newline='', encoding="UTF-8") as write_obj:
        #    csv_writer = writer(write_obj)
        for item in g_data:
            print(item)       
        break

I managed this with the Claims, Description and Abstract, but I am not able to extract the Classification codes with the description. I tried various classes and div's and looked in detail at the child div's, but I can't find the problem. Please help.

Upvotes: 2

Views: 3142

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195563

To get codes from the Google patent page, you can use this example:

import requests
from bs4 import BeautifulSoup

url = 'https://patents.google.com/patent/EP3017304B1/en'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for code in soup.select('[itemprop="Code"]:has(~ meta[itemprop="Leaf"])'):
    print(code.text)
    print(code.find_next('span').text)
    print('-' * 80)

Prints:

G01N33/5438
Electrodes
--------------------------------------------------------------------------------
G01N27/3275
Sensing specific biomolecules, e.g. nucleic acid strands, based on an electrode surface reaction
--------------------------------------------------------------------------------
G01N33/5308
Immunoassay; Biospecific binding assay; Materials therefor for analytes not provided for elsewhere, e.g. nucleic acids, uric acid, worms, mites
--------------------------------------------------------------------------------
G01N33/5436
Immunoassay; Biospecific binding assay; Materials therefor with an insoluble carrier for immobilising immunochemicals with ligand physically entrapped within the solid phase
--------------------------------------------------------------------------------
G01N33/544
Immunoassay; Biospecific binding assay; Materials therefor with an insoluble carrier for immobilising immunochemicals the carrier being organic
--------------------------------------------------------------------------------
G01N33/9413
Dopamine
--------------------------------------------------------------------------------
G01N33/9446
Antibacterials
--------------------------------------------------------------------------------
G01N33/946
CNS-stimulants, e.g. cocaine, amphetamines
--------------------------------------------------------------------------------
G01N2333/78
Connective tissue peptides, e.g. collagen, elastin, laminin, fibronectin, vitronectin, cold insoluble globulin [CIG]
--------------------------------------------------------------------------------

EDIT: For status of the applications:

import requests
from bs4 import BeautifulSoup

url = 'https://patents.google.com/patent/EP3017304B1/en'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for application in soup.select('li[itemprop="application"]'):
    print(application.select_one('[itemprop="countryCode"]').text)
    print(application.select_one('[itemprop="applicationNumber"]').text)
    print(application.select_one('[itemprop="legalStatus"]').text)
    print('-' * 80)

Prints:

WO
PCT/EP2014/064249
Application Filing
--------------------------------------------------------------------------------
US
US14/901,760
Active
--------------------------------------------------------------------------------
EP
EP14737196.7A
Active
--------------------------------------------------------------------------------
EP
EP17184772.6A
Withdrawn
--------------------------------------------------------------------------------
ES
ES14737196.7T
Active
--------------------------------------------------------------------------------
US
US15/702,938
Active
--------------------------------------------------------------------------------

Upvotes: 2

Related Questions