Reputation: 41
I am new to python and just started learning web-scraping using beautiful soup (in Jupyter notebook). I scraped a book off Project Gutenberg, and want to do translation. However, had trouble cleaning the text, followed by doing the translation.
I want to get rid of the stuff at the beginning of the scraped text (e.g.BODY { color: Black; background: White;....) and after that translate the entire text using google API.
Would be grateful for help/advice on both. my code so far is below.The translation code did not work, and returned the following error "WriteError: [Errno 32] Broken pipe"
#Store url
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
html = r.text
print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")
type(soup)
#Scrape entire text using 'get' and print it
text = soup.get_text()
print(text)
#translate text using google API translator
init the Google API translator
translator = Translator()
translation = translator.translate(text,dest="ar")
print(translation)
Upvotes: 1
Views: 815
Reputation: 3400
As you want to scrape the text data so you can find it out from elements that text is written in p
tag with find_all
method in bs4
module so you can get the text data from it
from bs4 import BeautifulSoup
import requests
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
response=requests.get(url)
html = response.text
# print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html.parser")
paragraph=soup.find_all("p")
for para in paragraph:
print(para.text)
Output:
"Christmas won't be Christmas without any presents," grumbled Jo, lying
on the rug.
...
Upvotes: 2