Gansaikhan Shur
Gansaikhan Shur

Reputation: 75

How to Get Text between Span Tags XPATH Python

I am working with this website https://www.pealim.com/dict/?page=1. So I basically want to get the hebrew word and its pronunciation.

Below is my code and it loops through all the td tags however, it produces the exact same output which is the following {'latin': 'av', 'hebrew': u'\u05d0\u05b8\u05d1'} And this code is for only page=1. I would love to know if there is any automated way to loop through every page.

import requests
from lxml import etree

resp = requests.get("https://www.pealim.com/dict/?page=1")

htmlparser = etree.HTMLParser()
tree = etree.fromstring(resp.text, htmlparser)

for td in tree.xpath('//*//table[@class="table table-hover dict-table-t"]/tbody/tr'):
    print(td)
    data = {
        'hebrew': td.xpath('string(//span[@class="menukad"])'),
        'latin': td.xpath('string(//span[@class="dict-transcription"])'),
    }
    print(data)

I would like to collect information for every single entry in that website. Please let me know what I can do to achieve this.

Upvotes: 2

Views: 119

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195573

import requests
from bs4 import BeautifulSoup
from pprint import pprint

for i in range(1, 411):
    data = []
    resp = requests.get("https://www.pealim.com/dict/?page={}".format(i))
    soup = BeautifulSoup(resp.text, 'lxml')
    for m, t in zip(soup.select('.menukad'), soup.select('.dict-transcription')):
        data.append((m.text, t.text))

    print('PAGE {}'.format(i))
    print('*' * 80)
    pprint(data)

Prints:

PAGE 1
********************************************************************************
[('אָב', 'av'),
 ('אַבָּא', 'aba'),
 ('אָבִיב', 'aviv'),
 ('אֵב', 'ev'),
 ('לֶאֱבוֹד', "le'evod"),
 ('לְהֵיאָבֵד', "lehe'aved"),
 ('לְאַבֵּד', "le'abed"),
 ('לְהִתְאַבֵּד', "lehit'abed"),
 ('לְהַאֲבִיד', "leha'avid"),
 ('הִתְאַבְּדוּת', "hit'abdut"),
 ('אִיבּוּד', 'ibud'),
 ('אֲבֵדָה', 'aveda'),
 ('אָבוּד', 'avud'),
 ('לְאַבְחֵן', "le'avchen"),
 ('אִיבְחוּן', 'ivchun')]
PAGE 2
********************************************************************************
[('לְאַבְטֵחַ', "le'avteach"),
 ('אִיבְטוּחַ', 'ivtuach'),
 ('אֲבַטִּיחַ', 'avatiach'),
 ('לֶאֱבוֹת', "le'evot"),
 ('אֵבֶל', 'evel'),
 ('לֶאֱבוֹל', "le'evol"),
 ('אֲבָל', 'aval'),
 ('לְהִתְאַבֵּל', "lehit'abel"),
 ('לְהִתְאַבֵּן', "lehit'aben"),
 ('אֶבֶן', 'even'),
 ('לְהַאֲבִיס', "leha'avis"),
 ('לְהֵיאָבֵק', "lehe'avek"),
 ('מַאֲבָק', "ma'avak"),
 ('לְאַבֵּק', "le'abek"),
 ('אָבָק', 'avak')]
PAGE 3
********************************************************************************
[('לְהִתְאַבֵּק', "lehit'abek"),
 ('לְהִתְאַבֵּק', "lehit'abek"),
 ('מְאוּבָּק', "me'ubak"),
 ('אִיבּוּק', 'ibuk'),

...and so on.

Upvotes: 4

Reedinationer
Reedinationer

Reputation: 5774

Andrej beat me to it, but alternatively you can use .find() and .get_text() methods of BeautifulSoup:

import bs4
import requests

for page_number in range(1, 411):
    print("-" * 35, page_number, "-" * 35)
    resp = requests.get("https://www.pealim.com/dict/?page={}".format(page_number))
    soup = bs4.BeautifulSoup(resp.text, "html.parser")
    table_elem = soup.find("tbody")
    rows = table_elem.find_all("tr")
    for row in rows:
        hebrew = row.find("span", class_="menukad").get_text()
        latin = row.find("span", class_="dict-transcription").get_text()
        print("{}: {}".format(hebrew, latin))

To yield essentially the same result.

Upvotes: 1

Related Questions