Reputation: 55
I have looked through the stackoverflow but still don't find a solution for that. Here is the html file I need to handle:
......<span ><span class='pl'>Director </span>: <span class='attrs'><a href="/celebrity/1022571/" rel="v:directedBy">James</a></span></span><br/>
<span ><span class='pl'>Actor</span>: <span class='attrs'><a href="/celebrity/1022571/">Tom</a></span></span><br/>
<span class="pl">Countries:</span> USA <br/>
<span class="pl">Language:</span> English <br/>......
There are many span
tags in the file.
Here is my code:
from bs4 import BeautifulSoup
record=[]
soup=BeautifulSoup(html)
spans=soup.find_all('span')
for span in spans:
record.append(span.text)
I use the code mentioned above, and I got 2 problem.
The first one is I got double Director
and Actor
in the result because they are in 2 span
tags. The second problem is that I can't get the text before <br>
tag. I don't want to use that following code:
soup.find("span", text="Language:").next_sibling
because for every br
tags I need to add that code to my project, it's annoying.
Do you have some elegant solutions?
Upvotes: 1
Views: 595
Reputation: 473803
If you want to write something generic, you would still need to locate the next sibling tag/text node with next_sibling
or find_next_sibling
.
Here is the code that would handle both cases - when there is an element after the label and the text node:
soup = BeautifulSoup(html, "html.parser")
for label in soup.find_all("span", class_="pl"):
value = label.find_next_sibling("span", class_="attrs")
value = label.next_sibling.strip() if not value else value.get_text(strip=True)
label = label.get_text(strip=True).strip(":")
print(label, value)
Prints:
Director James
Actor Tom
Countries USA
Language English
Upvotes: 1