astronaut
astronaut

Reputation: 77

Extract specific portions in html file using python

How can I extract a specific portion of a html file example https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry

So far I used beautifulsoup to get the text version of the html without all the tags. But I would like my code to read only say the claims sections of the above mentioned file.

Upvotes: 0

Views: 2153

Answers (3)

Roni Antonio
Roni Antonio

Reputation: 1450

here you have mate, i found out that in this site, the claims section is a html with its own Id, making things easier. I just colected the section and gave the string so you can play with it.

import requests
from bs4 import BeautifulSoup
page = requests.get("https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry")
soup = BeautifulSoup(page.content, 'html.parser')
claim_sect = soup.find_all('section', attrs={"itemprop":"claims"})
print('This is the raw content: \n') 
print(str(claim_sect)) 
print('This is the variable type: \n') 
print(str(type(claim_sect))) 
str_sect  =  claim_sect[0]

Upvotes: 2

astronaut
astronaut

Reputation: 77

filename= 'C:/Users/xyz/.ipynb_checkpoints/EP1208209A1.html'
html_file =open(filename, 'r', encoding='utf-8')
source_code = html_file.read() 
#print(source_code)
soup = BeautifulSoup(source_code)
print(soup.get_text())
#mydivs = soup.findAll("div", {"class": "flex flex-width style-scope patent-result"})
#div_with_claims = mydivs [1]
#print(div_with_claims)

Upvotes: 0

Roni Antonio
Roni Antonio

Reputation: 1450

As far as I see, there are two divs with the class="flex flex-width style-scope patent-result".

soup = BeautifulSoup(sdata)
mydivs = soup.findAll("div", {"class": "flex flex-width style-scope patent-result"})
div_with_claims = mydivs [1]

Upvotes: 0

Related Questions