Reputation: 36374
So, I wrote a minimal function to scrape all the text from a webpage:
url = 'http://www.brainpickings.org'
request = requests.get(url)
soup_data = BeautifulSoup(request.content)
texts = soup_data.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
return True
print filter(visible,texts)
But, it doesn't work that smooth. There are still unnecessary tags that are there. Also, if I try to to do a reg-ex removal of various characters that I don't want, I get an
error elif re.match('<!--.*-->', str(element)):
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 209: ordinal not in range(128)
Thus, how can I improve this a bit more to make it better?
Upvotes: 0
Views: 123