Python: failed to get all the text in all the tags using BeautifulSoup

Question

I have looked through the stackoverflow but still don't find a solution for that. Here is the html file I need to handle:

......Director : James

Actor: Tom

Countries: USA 

Language: English 
......

There are many span tags in the file. Here is my code:

from bs4 import BeautifulSoup

record=[]
soup=BeautifulSoup(html)
spans=soup.find_all('span')
for span in spans:
    record.append(span.text)

I use the code mentioned above, and I got 2 problem. The first one is I got double Director and Actor in the result because they are in 2 span tags. The second problem is that I can't get the text before tag. I don't want to use that following code:

soup.find("span", text="Language:").next_sibling

because for every br tags I need to add that code to my project, it's annoying. Do you have some elegant solutions?

alecxe · Accepted Answer

If you want to write something generic, you would still need to locate the next sibling tag/text node with next_sibling or find_next_sibling.

Here is the code that would handle both cases - when there is an element after the label and the text node:

soup = BeautifulSoup(html, "html.parser")

for label in soup.find_all("span", class_="pl"):
    value = label.find_next_sibling("span", class_="attrs")
    value = label.next_sibling.strip() if not value else value.get_text(strip=True)

    label = label.get_text(strip=True).strip(":")
    print(label, value)

Prints:

Director James
Actor Tom
Countries USA
Language English

Python: failed to get all the text in all the <span> tags using BeautifulSoup

Answers (1)

Related Questions

Python: failed to get all the text in all the &lt;span&gt; tags using BeautifulSoup

Answers (1)

Related Questions

Python: failed to get all the text in all the <span> tags using BeautifulSoup