Reputation: 3324
I want to parse html code in python and tried beautiful soup and pyquery already. The problem is that those parsers modify original code e.g insert some tag or etc. Is there any parser out there that do not change the code?
I tried HTMLParser
but no success! :(
It doesn't modify the code and just tells me where tags are placed. But it fails in parsing web pages like mail.live.com
Any idea how to parse a web page just like a browser?
Upvotes: 1
Views: 296
Reputation: 16625
Have you tried the webkit engine with Python bindings?
See this: https://github.com/niwibe/phantompy
You can traverse the real DOM of the parsed web page and do what you need to do.
Upvotes: 0
Reputation: 3324
No, to this moment there is no such HTML parser and every parser has it's own limitations.
Upvotes: 0
Reputation:
You can use BeautifulSoup to extract just text and not modify the tags. Its in their documentation.
Same question here: How to extract text from beautiful soup
Upvotes: 1