Mehraban
Mehraban

Reputation: 3324

python html parser which doesn't modify actual markup?

I want to parse html code in python and tried beautiful soup and pyquery already. The problem is that those parsers modify original code e.g insert some tag or etc. Is there any parser out there that do not change the code?


I tried HTMLParser but no success! :( It doesn't modify the code and just tells me where tags are placed. But it fails in parsing web pages like mail.live.com Any idea how to parse a web page just like a browser?

Upvotes: 1

Views: 296

Answers (3)

Jiri
Jiri

Reputation: 16625

Have you tried the webkit engine with Python bindings?

See this: https://github.com/niwibe/phantompy

You can traverse the real DOM of the parsed web page and do what you need to do.

Upvotes: 0

Mehraban
Mehraban

Reputation: 3324

No, to this moment there is no such HTML parser and every parser has it's own limitations.

Upvotes: 0

user723556
user723556

Reputation:

You can use BeautifulSoup to extract just text and not modify the tags. Its in their documentation.

Same question here: How to extract text from beautiful soup

Upvotes: 1

Related Questions