Reputation: 749
I want to parse a part of html page with BeautifulSoup.
Here is my code:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
body = """Some text
<body{block:PermalinkPage} class="inside"{/block:PermalinkPage}>
Some text
"""
print BeautifulSoup(body, 'html5lib')
Output is
<html><head></head><body>Some text
<body{block:permalinkpage} block:permalinkpage}="" class="inside" {="">
Some text
</body{block:permalinkpage}></body></html>
The desired output is
<html><head></head><body>Some text
<body{block:PermalinkPage} class="inside"{/block:PermalinkPage}>
Some text
</body{block:permalinkpage}></body></html>
Why does the BeautifulSoup change this code so much? Is it possible to force it to work like I expect? What library should I use to get the desired output?
Upvotes: 0
Views: 192
Reputation: 88128
This doesn't look like valid html (though I could be wrong). Underneath BeautifulSoup uses a parser, which in this case you've explicitly forced to be html5lib
. If the underlying parser can't handle your input, bs4 won't either.
It looks like you are feeding it some logic template language that can be processed into html (like for example mustache or slim), but it's hard to say without any context.
Upvotes: 1