Tim Smaluk
Tim Smaluk

Reputation: 35

BeautifulSoup doesn't read html

I managed to get requests to work when calling a url with specific headers, and the pages html prints when I call r.content.

url = 'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=9854726.PN.&OS=PN/9854726&RS=PN/9854726'
HEADERS = { 'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0"} 

r = requests.get(url, headers = HEADERS)
r = r.content

The output is as expected (this is a shortened version since I don't want to spam the entire html):

<HTML>
<HEAD>
<BASE target="_top">
<TITLE>United States Patent: 9854726</TITLE></HEAD>
<!-BUF1=9854726
BUF7=2018
BUF8=48007
BUF9=/1/
BUF51=9
-->...

However, when I pass it into BeautifulSoup

soup = BeautifulSoup(r)
print soup.prettify()

it only prints out:

<html>
 <head>
  <base target="_top" />
  <title>
   United States Patent: 9854726
  </title>
 </head>
</html>

It doesn't print out the full html. I was wondering if there were any quick fixes for this? I have tried encoding the requests in UTF-8 but that hasn't worked. I have also tried using r.text instead of r.content but to no avail.

I know the USPO is an old website, so if there aren't any easy solutions than I'm going to try to parse it with regex.

Edit: I just figured it out. The problem was that the output of the BeautifulSoup wasn't being formatted properly. I used regex to delete it and join it back with the original html and it worked! Thanks for the help

Upvotes: 1

Views: 790

Answers (1)

DYZ
DYZ

Reputation: 57105

The HTML file is malformed (intentionally or unintentionally). It uses "<!-" to start a comment, instead of "<!--", and BS fails to recognize that comment. As a quick fix, replace the incorrect tag opener with a correct one:

soup = BeautifulSoup(r.replace("<!-", "<!--"))
print(soup.prettify())
#<html>
# <head>
#  <base target="_top"/>
#  <title>
#   United States Patent: 9854726
#  </title>
# </head>
# <!--BUF1=9854726
#BUF7=2018
#BUF8=48007
#BUF9=/1/
#BUF51=9-->
#</html>

You can follow up with the answers to another question to find out how to extract comments, e.g.:

soup.findAll(text=lambda text: isinstance(text, bs4.Comment))
#['BUF1=9854726\nBUF7=2018\nBUF8=48007\nBUF9=/1/\nBUF51=9']

Upvotes: 1

Related Questions