Reputation: 53
I am trying to scrape the original sentence and translations e.g. from the following page:
https://tatoeba.org/eng/sentences/search?query=Verwirrung&from=und&to=spa
My current status looks as following:
word = "Verwirrung"
url="https://tatoeba.org/eng/sentences/search?query={}&from=und&to=spa".format(word)
vstr=requests.get(url).content
Soup = BeautifulSoup(vstr,features="html.parser")
rows = Soup.findAll('div',{"class":"sentence-and-translations md-whiteframe-1dp"})
for row in rows:
if "Verw" in str(row):
print(row)
However, this returns nothing. The idea is to iterate over every entry, however each entry is wrapped in a div tag and loads of "quot" signs which I'm not sure how to find or filter for. The first entry is nested within it's div tag as following:
<div ng-cloak
sentence-and-translations
ng-init="vm.init([], {"id":922518,"text":"Die Verwirrung spottet jeder Beschreibung.","lang":"deu","langName":"German","script":null,"dir":"ltr","audios":[],"correctness":0,"isFavorite":false,"isOwnedByCurrentUser":false,"user":{"id":6,"username":"MUIRIEL","role":"advanced_contributor","level":0},"highlightedText":"Die <span class=\"match\">Verwirrung<\/span> spottet jeder Beschreibung."}, [], [{"id":76623,"text":"\u305d\u306e\u6df7\u4e71\u5b9f\u306b\u540d\u72b6\u3059\u3079\u304b\u3089\u305a\u3002","lang":"jpn","langName":"Japanese","script":null,"dir":"ltr","audios":null,"correctness":0,"isFavorite":false,"isOwnedByCurrentUser":false,"user":null}])"
class="sentence-and-translations md-whiteframe-1dp">
Is there a good clean way to only iterate over the data?
Upvotes: 0
Views: 498
Reputation: 142631
You can get row['ng-init']
and you will have all JavaScript but if you remove vm.init([],
at the beginning (11 chars) and )
at the end (1 char) and add strings [
]
then you get JSON data which you can convert to Python's structure with lists and dictionaries
data = '[' + row['ng-init'][11:-1] + ']'
data = json.loads(data)
and now you can use
print(data[0].keys())
print('DE:', data[0]['text'])
for item in data[1]:
print('ES:', item['text'])
for item in data[2]:
print('ES:', item['text'])
You have access to other elements using keys
'id', 'text', 'lang', 'langName', 'script', 'dir', 'audios', 'correctness',
'isFavorite', 'isOwnedByCurrentUser', 'user', 'highlightedText'
Full working code
from bs4 import BeautifulSoup #as BS
import requests
import json
word = "Verwirrung"
url = "https://tatoeba.org/eng/sentences/search?query={}&from=und&to=spa".format(word)
vstr = requests.get(url).content
Soup = BeautifulSoup(vstr, features="html.parser")
rows = Soup.findAll('div', {"class":"sentence-and-translations md-whiteframe-1dp"})
for row in rows:
data = '[' + row['ng-init'][11:-1] + ']'
#print(data)
data = json.loads(data)
#print(data[0].keys())
print('DE:', data[0]['text']) #, data[0]['lang'], data[0]['langName'])
for item in data[1]:
print('ES:', item['text']) #, item['lang'], item['langName'])
for item in data[2]:
print('ES:', item['text']) #, item['lang'], item['langName'])
Result
DE: Sie verwirren mich.
ES: Me estás confundiendo.
DE: Verwirre ich dich?
ES: ¿Te estoy confundiendo?
ES: ¿Te estoy liando?
DE: Verwirre ich Sie?
ES: ¿Te estoy confundiendo?
DE: Verwirre ich euch?
ES: ¿Te estoy confundiendo?
DE: Die Berichte waren verwirrend.
ES: Los informes eran confusos.
DE: Es war sehr verwirrend.
ES: Fue muy confuso.
ES: Era muy confuso.
DE: Es kann zunächst verwirrend sein.
ES: Al principio puede ser confuso.
DE: Es ist mitunter zunächst verwirrend.
ES: Al principio puede ser confuso.
DE: Das kann anfangs Verwirrung stiften.
ES: Al principio puede ser confuso.
DE: Oh, jetzt ist es wirklich verwirrend...
ES: Oh, ahora es realmente extraño...
ES: Oh, ahora es realmente confuso...
ES: Ay, ahora sí que está raro...
Upvotes: 1
Reputation: 2097
The data you want is here:
<div ng-cloak
sentence-and-translations
ng-init="vm.init([], {"id":5127750,"text":"Verwirre ich dich?","lang":"deu","langName":"German","script":null,"dir":"ltr","audios":[{"user_id":75624,"external":null,"sentence_id":5127750,"user":{"username":"Oblomov","audio_license":null,"audio_attribution_url":null}}],"correctness":0,"isFavorite":false,"isOwnedByCurrentUser":false,"user":{"id":54488,"username":"raggione","role":"advanced_contributor","level":0},"highlightedText":"<span class=\"match\">Verwirre<\/span> ich dich?"}, [{"id":5127755,"text":"\u00bfTe estoy confundiendo?","lang":"spa","langName":"Spanish","script":null,"dir":"ltr","audios":[],"correctness":0,"isFavorite":false,"isOwnedByCurrentUser":false,"user":null},{"id":5127757,"text":"\u00bfTe estoy liando?","lang":"spa","langName":"Spanish","script":null,"dir":"ltr","audios":[],"correctness":0,"isFavorite":false,"isOwnedByCurrentUser":false,"user":null}], [])"
class="sentence-and-translations md-whiteframe-1dp">
So concentrate you efforts here:
word = "Verwirrung"
url="https://tatoeba.org/eng/sentences/search?query={}&from=und&to=spa".format(word)
vstr=requests.get(url).content
Soup = BeautifulSoup(vstr,features="html.parser")
div = Soup.findAll('div',{"class":"sentence-and-translations md-whiteframe-1dp"})
ng_init = div[0]["ng-init"]
ng_init = ng_init.replace("vm.init([], ","").replace("])","")
And this gets you nicely formatted senteces, included for free:
{"id":5127745,"text":"Sie verwirren mich.","lang":"deu","langName":"German","script":null,"dir":"ltr","audios":[],"correctness":0,"isFavorite":false,"isOwnedByCurrentUser":false,"user":{"id":54488,"username":"raggione","role":"advanced_contributor","level":0},"highlightedText":"Sie <span class=\\"match\\">verwirren<\\/span> mich."}, [], [{"id":2931574,"text":"Me est\\u00e1s confundiendo.","lang":"spa","langName":"Spanish","script":null,"dir":"ltr","audios":null,"correctness":0,"isFavorite":false,"isOwnedByCurrentUser":false,"user":null}
Upvotes: 1
Reputation: 5730
Use selenium for this:
from selenium import webdriver
import os
browser = webdriver.Chrome(executable_path=os.path.abspath(os.getcwd()) + "/chromedriver")
link = 'https://tatoeba.org/eng/sentences/search?query=Verwirrung&from=und&to=spa'
browser.get(link)
raw_data = browser.find_elements_by_class_name('text.ng-binding.flex')
for item in raw_data:
print(item.text)
Output:
Sie verwirren mich.
Me estás confundiendo.
Verwirre ich dich?
¿Te estoy confundiendo?
¿Te estoy liando?
Verwirre ich Sie?
¿Te estoy confundiendo?
Verwirre ich euch?
¿Te estoy confundiendo?
Die Berichte waren verwirrend.
Los informes eran confusos.
Es war sehr verwirrend.
Fue muy confuso.
Era muy confuso.
Es kann zunächst verwirrend sein.
Al principio puede ser confuso.
Es ist mitunter zunächst verwirrend.
Al principio puede ser confuso.
Das kann anfangs Verwirrung stiften.
Al principio puede ser confuso.
Oh, jetzt ist es wirklich verwirrend...
Oh, ahora es realmente extraño...
Oh, ahora es realmente confuso...
Ay, ahora sí que está raro...
Upvotes: 1