9000
9000

Reputation: 40894

Extract text from HTML faster than NLTK?

We use NLTK to extract text from HTML pages, but we want only most trivial text analysis, e.g. word count.

Is there a faster way to extract visible text from HTML using Python?

Understanding HTML (and ideally CSS) at some minimal level, like visible / invisible nodes, images' alt texts, etc, would be additionally great.

Upvotes: 1

Views: 1125

Answers (1)

alexisdevarennes
alexisdevarennes

Reputation: 5642

Ran into the same problem at my previous workplace. You'll want to check out beautifulsoup.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.text

You'll find its documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

You can ignore elements based on attributes. As to understanding external stylesheets im not too sure. However what you could do there and something that would not be too slow (depending on the page) is to look into rendering the page with something like phantomjs and then selecting the rendered text :)

Upvotes: 2

Related Questions