Reputation: 63
I have some url address that is to a txt file, which contians html code. This is a sample link:
http://www.sec.gov/Archives/edgar/data/70858/000119312507058027/0001193125-07-058027.txt
I want to read this html code with BeautifulSoup with such a code:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.sec.gov/Archives/edgar/data/70858/000119312507058027/0001193125-07-058027.txt"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
print (soup.prettify())
However, I got a lot of errors like:
File "C:/Users/.../aa.py", line 7, in <module> print (soup.prettify())
File "build\bdist.win32\egg\bs4\element.py", line 1097, in prettify
return self.decode(True, formatter=formatter)
I am suspicous that it happens because the url is to a txt file not a html. Am i right? If so, can someone let me know what is the solution here?
Upvotes: 1
Views: 2088
Reputation: 521
You could try just feeding the HTML section of the text file (from the tag) into Beautiful soup, I imagine its breaking because the start of the text file doesn't contain any HTML.
Upvotes: 1