Reputation: 35
I managed to get requests to work when calling a url with specific headers, and the pages html prints when I call r.content.
url = 'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=9854726.PN.&OS=PN/9854726&RS=PN/9854726'
HEADERS = { 'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0"}
r = requests.get(url, headers = HEADERS)
r = r.content
The output is as expected (this is a shortened version since I don't want to spam the entire html):
<HTML>
<HEAD>
<BASE target="_top">
<TITLE>United States Patent: 9854726</TITLE></HEAD>
<!-BUF1=9854726
BUF7=2018
BUF8=48007
BUF9=/1/
BUF51=9
-->...
However, when I pass it into BeautifulSoup
soup = BeautifulSoup(r)
print soup.prettify()
it only prints out:
<html>
<head>
<base target="_top" />
<title>
United States Patent: 9854726
</title>
</head>
</html>
It doesn't print out the full html. I was wondering if there were any quick fixes for this? I have tried encoding the requests in UTF-8 but that hasn't worked. I have also tried using r.text instead of r.content but to no avail.
I know the USPO is an old website, so if there aren't any easy solutions than I'm going to try to parse it with regex.
Edit: I just figured it out. The problem was that the output of the BeautifulSoup wasn't being formatted properly. I used regex to delete it and join it back with the original html and it worked! Thanks for the help
Upvotes: 1
Views: 790
Reputation: 57105
The HTML file is malformed (intentionally or unintentionally). It uses "<!-"
to start a comment, instead of "<!--"
, and BS fails to recognize that comment. As a quick fix, replace the incorrect tag opener with a correct one:
soup = BeautifulSoup(r.replace("<!-", "<!--"))
print(soup.prettify())
#<html>
# <head>
# <base target="_top"/>
# <title>
# United States Patent: 9854726
# </title>
# </head>
# <!--BUF1=9854726
#BUF7=2018
#BUF8=48007
#BUF9=/1/
#BUF51=9-->
#</html>
You can follow up with the answers to another question to find out how to extract comments, e.g.:
soup.findAll(text=lambda text: isinstance(text, bs4.Comment))
#['BUF1=9854726\nBUF7=2018\nBUF8=48007\nBUF9=/1/\nBUF51=9']
Upvotes: 1