user15902782
user15902782

Reputation: 41

cleaning scraped text in python

I am new to python and just started learning web-scraping using beautiful soup (in Jupyter notebook). I scraped a book off Project Gutenberg, and want to do translation. However, had trouble cleaning the text, followed by doing the translation.

I want to get rid of the stuff at the beginning of the scraped text (e.g.BODY { color: Black; background: White;....) and after that translate the entire text using google API.

Would be grateful for help/advice on both. my code so far is below.The translation code did not work, and returned the following error "WriteError: [Errno 32] Broken pipe"

#Store url
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
html = r.text
print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")
type(soup)
#Scrape entire text using 'get' and print it
text = soup.get_text()
print(text)
#translate text using google API translator
init the Google API translator
translator = Translator()
translation = translator.translate(text,dest="ar")
print(translation)

Upvotes: 1

Views: 815

Answers (1)

Bhavya Parikh
Bhavya Parikh

Reputation: 3400

As you want to scrape the text data so you can find it out from elements that text is written in p tag with find_all method in bs4 module so you can get the text data from it

from bs4 import BeautifulSoup
import requests
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
response=requests.get(url)
html = response.text
# print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html.parser")
paragraph=soup.find_all("p")
for para in paragraph:
    print(para.text)

Output:
"Christmas won't be Christmas without any presents," grumbled Jo, lying
on the rug.
...

Upvotes: 2

Related Questions