SKulibin
SKulibin

Reputation: 749

BeautifulSoup: parse part page (tumblr template), unexpected result

I want to parse a part of html page with BeautifulSoup.

Here is my code:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

body = """Some text
<body{block:PermalinkPage} class="inside"{/block:PermalinkPage}>
Some text
"""

print BeautifulSoup(body, 'html5lib')

Output is

<html><head></head><body>Some text
<body{block:permalinkpage} block:permalinkpage}="" class="inside" {="">
Some text
</body{block:permalinkpage}></body></html>

The desired output is

<html><head></head><body>Some text
<body{block:PermalinkPage} class="inside"{/block:PermalinkPage}>
Some text
</body{block:permalinkpage}></body></html>

Why does the BeautifulSoup change this code so much? Is it possible to force it to work like I expect? What library should I use to get the desired output?

Upvotes: 0

Views: 192

Answers (1)

Hooked
Hooked

Reputation: 88128

This doesn't look like valid html (though I could be wrong). Underneath BeautifulSoup uses a parser, which in this case you've explicitly forced to be html5lib. If the underlying parser can't handle your input, bs4 won't either.

It looks like you are feeding it some logic template language that can be processed into html (like for example mustache or slim), but it's hard to say without any context.

Upvotes: 1

Related Questions