Reputation: 75
I am working with this website https://www.pealim.com/dict/?page=1. So I basically want to get the hebrew word and its pronunciation.
Below is my code and it loops through all the td tags however, it produces the exact same output which is the following {'latin': 'av', 'hebrew': u'\u05d0\u05b8\u05d1'}
And this code is for only page=1
. I would love to know if there is any automated way to loop through every page.
import requests
from lxml import etree
resp = requests.get("https://www.pealim.com/dict/?page=1")
htmlparser = etree.HTMLParser()
tree = etree.fromstring(resp.text, htmlparser)
for td in tree.xpath('//*//table[@class="table table-hover dict-table-t"]/tbody/tr'):
print(td)
data = {
'hebrew': td.xpath('string(//span[@class="menukad"])'),
'latin': td.xpath('string(//span[@class="dict-transcription"])'),
}
print(data)
I would like to collect information for every single entry in that website. Please let me know what I can do to achieve this.
Upvotes: 2
Views: 119
Reputation: 195573
import requests
from bs4 import BeautifulSoup
from pprint import pprint
for i in range(1, 411):
data = []
resp = requests.get("https://www.pealim.com/dict/?page={}".format(i))
soup = BeautifulSoup(resp.text, 'lxml')
for m, t in zip(soup.select('.menukad'), soup.select('.dict-transcription')):
data.append((m.text, t.text))
print('PAGE {}'.format(i))
print('*' * 80)
pprint(data)
Prints:
PAGE 1
********************************************************************************
[('אָב', 'av'),
('אַבָּא', 'aba'),
('אָבִיב', 'aviv'),
('אֵב', 'ev'),
('לֶאֱבוֹד', "le'evod"),
('לְהֵיאָבֵד', "lehe'aved"),
('לְאַבֵּד', "le'abed"),
('לְהִתְאַבֵּד', "lehit'abed"),
('לְהַאֲבִיד', "leha'avid"),
('הִתְאַבְּדוּת', "hit'abdut"),
('אִיבּוּד', 'ibud'),
('אֲבֵדָה', 'aveda'),
('אָבוּד', 'avud'),
('לְאַבְחֵן', "le'avchen"),
('אִיבְחוּן', 'ivchun')]
PAGE 2
********************************************************************************
[('לְאַבְטֵחַ', "le'avteach"),
('אִיבְטוּחַ', 'ivtuach'),
('אֲבַטִּיחַ', 'avatiach'),
('לֶאֱבוֹת', "le'evot"),
('אֵבֶל', 'evel'),
('לֶאֱבוֹל', "le'evol"),
('אֲבָל', 'aval'),
('לְהִתְאַבֵּל', "lehit'abel"),
('לְהִתְאַבֵּן', "lehit'aben"),
('אֶבֶן', 'even'),
('לְהַאֲבִיס', "leha'avis"),
('לְהֵיאָבֵק', "lehe'avek"),
('מַאֲבָק', "ma'avak"),
('לְאַבֵּק', "le'abek"),
('אָבָק', 'avak')]
PAGE 3
********************************************************************************
[('לְהִתְאַבֵּק', "lehit'abek"),
('לְהִתְאַבֵּק', "lehit'abek"),
('מְאוּבָּק', "me'ubak"),
('אִיבּוּק', 'ibuk'),
...and so on.
Upvotes: 4
Reputation: 5774
Andrej beat me to it, but alternatively you can use .find()
and .get_text()
methods of BeautifulSoup
:
import bs4
import requests
for page_number in range(1, 411):
print("-" * 35, page_number, "-" * 35)
resp = requests.get("https://www.pealim.com/dict/?page={}".format(page_number))
soup = bs4.BeautifulSoup(resp.text, "html.parser")
table_elem = soup.find("tbody")
rows = table_elem.find_all("tr")
for row in rows:
hebrew = row.find("span", class_="menukad").get_text()
latin = row.find("span", class_="dict-transcription").get_text()
print("{}: {}".format(hebrew, latin))
To yield essentially the same result.
Upvotes: 1