Reputation: 3812
Given the HTML code below I want output just the text of the h1 but not the "Details about ", which is the text of the span (which is encapsulated by the h1).
My current output gives:
Details about New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
I would like:
New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
Here is the HTML I am working with
<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>
Here is my current code:
for line in soup.find_all('h1',attrs={'itemprop':'name'}):
print line.get_text()
Note: I do not want to just truncate the string because I would like this code to have some re-usability. What would be best is some code that crops out any text that is bounded by the span.
Upvotes: 6
Views: 1300
Reputation: 627344
You can use extract()
to remove all span
tags:
for line in soup.find_all('h1',attrs={'itemprop':'name'}):
[s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
Upvotes: 5
Reputation: 9909
One solution is to check if the string contains html
:
from bs4 import BeautifulSoup
html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')
for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
for content in line.contents:
if bool(BeautifulSoup(str(content), "html.parser").find()):
continue
print content
Another solution (which I prefer) is to check for instance of bs4.element.Tag
:
import bs4
html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')
for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
for content in line.contents:
if isinstance(content, bs4.element.Tag):
continue
print content
Upvotes: 0