Reputation: 40894
We use NLTK to extract text from HTML pages, but we want only most trivial text analysis, e.g. word count.
Is there a faster way to extract visible text from HTML using Python?
Understanding HTML (and ideally CSS) at some minimal level, like visible / invisible nodes, images' alt texts, etc, would be additionally great.
Upvotes: 1
Views: 1125
Reputation: 5642
Ran into the same problem at my previous workplace. You'll want to check out beautifulsoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.text
You'll find its documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can ignore elements based on attributes. As to understanding external stylesheets im not too sure. However what you could do there and something that would not be too slow (depending on the page) is to look into rendering the page with something like phantomjs and then selecting the rendered text :)
Upvotes: 2