BeautifulSoup doesn't read html

Question

I managed to get requests to work when calling a url with specific headers, and the pages html prints when I call r.content.

url = 'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=9854726.PN.&OS=PN/9854726&RS=PN/9854726'
HEADERS = { 'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0"} 

r = requests.get(url, headers = HEADERS)
r = r.content

The output is as expected (this is a shortened version since I don't want to spam the entire html):




United States Patent: 9854726
...

However, when I pass it into BeautifulSoup

soup = BeautifulSoup(r)
print soup.prettify()

it only prints out:


 
  
  
   United States Patent: 9854726

It doesn't print out the full html. I was wondering if there were any quick fixes for this? I have tried encoding the requests in UTF-8 but that hasn't worked. I have also tried using r.text instead of r.content but to no avail.

I know the USPO is an old website, so if there aren't any easy solutions than I'm going to try to parse it with regex.

Edit: I just figured it out. The problem was that the output of the BeautifulSoup wasn't being formatted properly. I used regex to delete it and join it back with the original html and it worked! Thanks for the help

DYZ · Accepted Answer

The HTML file is malformed (intentionally or unintentionally). It uses " to start a comment, instead of " #

You can follow up with the answers to another question to find out how to extract comments, e.g.:



soup.findAll(text=lambda text: isinstance(text, bs4.Comment))
#['BUF1=9854726
BUF7=2018
BUF8=48007
BUF9=/1/
BUF51=9']

BeautifulSoup doesn't read html

Answers (1)

Related Questions

BeautifulSoup doesn&#39;t read html

Answers (1)

Related Questions

BeautifulSoup doesn't read html