Sarim
Sarim

Reputation: 3229

CSS aware intelligent html parser for python

I'm looking for a HTML parser which is css aware and works same way a browser renders html. I'm actually looking for equivalent of element.innerText (DOM-JS). Let me give a example. consider the following html,

<style>
.AAA { display:inline;}
.BBB { display:none;}
.CCC { display:inline ;}
</style>
<span id="sarim">

    <span class="AAA">a</span>
    <span style="display:none">b</span>
    c
    <span class="CCC">d</span>
    <div style="display:inline">e</div>
    <span class="BBB">f</span>

</span>

Now If i run the above html in a browser and run document.getElementById('sarim').innerText is returns "a c d e". Thats exactly what i need. But if i use a html parser and strip the html tags it would return "abcdef". I need a parser which will automatically ignore "b" and "f" reading their css property.

Any idea which parser supports this ? I tried Beautiful soap,

hiddenelements = sarim.findAll(True, {'style' : 'display:none'})
for p in hiddenelements:
    p.extract()

Now sarim.text returns the text but this only works for inline style and this is manual process which fails for the css class based styles, and as the classes will be random, i'm looking for a intelligent parser which will automatically do this.

I got a failsafe idea to run a headless wekbit (phantomjs.org) and use element.innerText to retrive the visible text, Any better idea ?

Upvotes: 2

Views: 751

Answers (2)

Jamie Mason
Jamie Mason

Reputation: 4211

I've made a CSS aware HTML minifier using PhantomJS at https://github.com/JamieMason/Asterisk - it would be easy to fork and modify it for your purpose.

The main work is done using https://github.com/JamieMason/Asterisk/blob/master/src/browser.js, for my use-case I inspect the styles to generate HTML output - but you could return the innerText instead.

Upvotes: 0

xiaowl
xiaowl

Reputation: 5217

How about Python-Webkit It's a Python binding of webkit.

The Python Webkit DOM Project makes python a full peer of javascript when it comes to accessing and manipulating the full features available to Webkit, such as HTML5. Everything that can be done with javascript, such as getElementsbyTagName and appendChild, event callbacks through onclick, timeout callbacks through window.setTimeout, and even AJAX using XMLHttpRequest, can also be done from python.

Upvotes: 1

Related Questions