KengoTokukawa
KengoTokukawa

Reputation: 55

Python: failed to get all the text in all the <span> tags using BeautifulSoup

I have looked through the stackoverflow but still don't find a solution for that. Here is the html file I need to handle:

......<span ><span class='pl'>Director </span>: <span class='attrs'><a href="/celebrity/1022571/" rel="v:directedBy">James</a></span></span><br/>
<span ><span class='pl'>Actor</span>: <span class='attrs'><a href="/celebrity/1022571/">Tom</a></span></span><br/>
<span class="pl">Countries:</span> USA <br/>
<span class="pl">Language:</span> English <br/>......

There are many span tags in the file. Here is my code:

from bs4 import BeautifulSoup

record=[]
soup=BeautifulSoup(html)
spans=soup.find_all('span')
for span in spans:
    record.append(span.text)

I use the code mentioned above, and I got 2 problem. The first one is I got double Director and Actor in the result because they are in 2 span tags. The second problem is that I can't get the text before <br> tag. I don't want to use that following code:

soup.find("span", text="Language:").next_sibling

because for every br tags I need to add that code to my project, it's annoying. Do you have some elegant solutions?

Upvotes: 1

Views: 595

Answers (1)

alecxe
alecxe

Reputation: 473803

If you want to write something generic, you would still need to locate the next sibling tag/text node with next_sibling or find_next_sibling.

Here is the code that would handle both cases - when there is an element after the label and the text node:

soup = BeautifulSoup(html, "html.parser")

for label in soup.find_all("span", class_="pl"):
    value = label.find_next_sibling("span", class_="attrs")
    value = label.next_sibling.strip() if not value else value.get_text(strip=True)

    label = label.get_text(strip=True).strip(":")
    print(label, value)

Prints:

Director James
Actor Tom
Countries USA
Language English

Upvotes: 1

Related Questions