cleaning scraped text in python

Question

I am new to python and just started learning web-scraping using beautiful soup (in Jupyter notebook). I scraped a book off Project Gutenberg, and want to do translation. However, had trouble cleaning the text, followed by doing the translation.

I want to get rid of the stuff at the beginning of the scraped text (e.g.BODY { color: Black; background: White;....) and after that translate the entire text using google API.

Would be grateful for help/advice on both. my code so far is below.The translation code did not work, and returned the following error "WriteError: [Errno 32] Broken pipe"

#Store url
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
html = r.text
print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")
type(soup)
#Scrape entire text using 'get' and print it
text = soup.get_text()
print(text)
#translate text using google API translator
init the Google API translator
translator = Translator()
translation = translator.translate(text,dest="ar")
print(translation)

cleaning scraped text in python

Answers (1)

Related Questions