Reputation: 741
I am tyring to parse an HTML page using BeautifulSoup. I've seen that once I did the parsing I get distortion in the output HTML file. The strange thing is that it exactly contains the same HTML (parsed with BeautifulSoup) as in the source file. Following is the code snippet I am using to achieve this:
output_pages = []
soup = BeautifulSoup(open(html_page, "r"), "lxml")
output_pages.append(soup.prettify())
with open(output_file, "w+") as f:
for html_page in output_pages:
f.write(html_page)
I tried some of its variants by using different arguments but none of them worked. Am I doing something wrong here or Is there any better way to parse HTML in python?
Upvotes: 1
Views: 59
Reputation: 2971
Yes, you should avoid using "soup.prettify()", it adds a lot of linebreaks (basically around every tag), which will add some extra spaces at places you don't want (for example, between words and punctuation).
"soup.prettify()" is actually not meant to be used for the html you save, it's just to print out for easier debugging.
Upvotes: 1