user2290820
user2290820

Reputation: 2759

Python: Better way to search and collect text strings from html. Strip off markdowns, tags, etc

There are many modules like lxml, Beautiful soup, nltk and pyenchant to correctly filter out proper english words. But then what is the cleanest shortest way like html2text offers, also if markdowns could be stripped off as well (While I write, there are scores of possible similar questions on the right) There could be a universal regex which could take away all the html tags?

def word_parse(f):
    raw = nltk.clean_html(f) #f = url.content here, from "requests" module
    regex = r'[a-zA-Z]+' # | ^[a-zA-Z]+\b'
    match = re.compile(regex)
    ls = []
    for line in raw.split():
        for mat in line.split():
            try:
                v = match.match(mat).group()
                map(ls.append, v.split())
            except AttributeError, e:
                pass

Is there some good code snippet somebody could suggest? Can someone suggest a much cleaner and optimized code here?

Upvotes: 0

Views: 287

Answers (1)

Peter DeGlopper
Peter DeGlopper

Reputation: 37344

I strongly recommend going with an existing library rather than trying to write your own regexps for this. Other people have put considerable work into Beautiful Soup, just for instance, and you might as well benefit from it.

For this specific case, Beautiful Soup offers the get_text method:

text = BeautifulSoup(f).get_text()

Upvotes: 2

Related Questions