Reputation: 2759
There are many modules like lxml, Beautiful soup, nltk and pyenchant to correctly filter out proper english words. But then what is the cleanest shortest way like html2text offers, also if markdowns could be stripped off as well (While I write, there are scores of possible similar questions on the right) There could be a universal regex which could take away all the html tags?
def word_parse(f):
raw = nltk.clean_html(f) #f = url.content here, from "requests" module
regex = r'[a-zA-Z]+' # | ^[a-zA-Z]+\b'
match = re.compile(regex)
ls = []
for line in raw.split():
for mat in line.split():
try:
v = match.match(mat).group()
map(ls.append, v.split())
except AttributeError, e:
pass
Is there some good code snippet somebody could suggest? Can someone suggest a much cleaner and optimized code here?
Upvotes: 0
Views: 287
Reputation: 37344
I strongly recommend going with an existing library rather than trying to write your own regexps for this. Other people have put considerable work into Beautiful Soup, just for instance, and you might as well benefit from it.
For this specific case, Beautiful Soup offers the get_text method:
text = BeautifulSoup(f).get_text()
Upvotes: 2