Ken Lin
Ken Lin

Reputation: 1035

Using BeautifulSoup4 with Google Translate

I am currently going through the Web Scraping section of AutomateTheBoringStuff and trying to write a script that extracts translated words from Google Translate using BeautifulSoup4.

I inspected the html content of a page where 'Explanation' is the translated word:

<span id="result_box" class="short_text" lang="en">  
    <span class>Explanation</span>
</span>

Using BeautifulSoup4, I tried different selectors but nothing would return the translated word. Here are a few examples I tried, but they return no results at all:

soup.select('span[id="result_box"] > span')  
soup.select('span span') 

I even copied the selector directly from the Developer Tools, which gave me #result_box > span. This again returns no results.

Can someone explain to me how to use BeautifulSoup4 for my purpose? This is my first time using BeautifulSoup4 but I think I am using BeautifulSoup more or less correctly because the selector

soup.select('span[id="result_box"]')

gets me the outer span element**

[<span class="short_text" id="result_box"></span>]

**Not sure why the 'leng="en"' part is missing but I am fairly certain I have located the correct element regardless.

Here is the complete code:

import bs4, requests

url = 'https://translate.google.ca/#zh-CN/en/%E6%B2%BB%E5%85%B7'
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, "html.parser")
translation = soup.select('#result_box span')
print(translation)

EDIT: If I save the Google Translate page as an offline html file and then make a soup object out of that html file, there would be no problem locating the element.

import bs4

file = open("Google Translate.html")
soup = bs4.BeautifulSoup(file, "html.parser")
translation = soup.select('#result_box span')
print(translation)

Upvotes: 4

Views: 3490

Answers (3)

Just Me
Just Me

Reputation: 1053

You can try this diferent aproach:

if filename.endswith(extension_file):
        with open(os.path.join(files_from_folder, filename), encoding='utf-8') as html:
            soup = BeautifulSoup('<pre>' + html.read() + '</pre>', 'html.parser')
            for title in soup.findAll('title'):
                recursively_translate(title)

FOR THE COMPLETE CODE, PLEASE SEE HERE:

https://neculaifantanaru.com/en/python-code-text-google-translate-website-translation-beautifulsoup-library.html

or HERE:

https://neculaifantanaru.com/en/example-google-translate-api-key-python-code-beautifulsoup.html

Upvotes: 0

akash karothiya
akash karothiya

Reputation: 5950

Simply try this :

translation = soup.select('#result_box span')[0].text
print(translation)

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

The result_box div is the correct element but your code only works when you save what you see in your browser as that includes the dynamically generated content, using requests you get only the source itself bar any dynamically generated content. The translation is generated by an ajax call to the url below:

"https://translate.google.ca/translate_a/single?client=t&sl=zh-CN&tl=en&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=902911.786207&q=%E6%B2%BB%E5%85%B7"

For your requests it returns:

[[["Fixture","治具",,,0],[,,,"Zhì jù"]],,"zh-CN",,,[["治 具",1,[["Fixture",999,true,false],["Fixtures",0,true,false],["Jig",0,true,false],["Jigs",0,true,false],["Governance",0,true,false]],[[0,2]],"治具",0,1]],1,,[["ja"],,[1],["ja"]]]

So you will either have to mimic the request, passing all the necessary parameters or use something that supports dynamic content like selenium

Upvotes: 3

Related Questions