Rorschach
Rorschach

Reputation: 3812

beautifulsoup .get_text() is not specific enough for my HTML parsing

Given the HTML code below I want output just the text of the h1 but not the "Details about  ", which is the text of the span (which is encapsulated by the h1).

My current output gives:

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

I would like:

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

Here is the HTML I am working with

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

Here is my current code:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    print line.get_text()

Note: I do not want to just truncate the string because I would like this code to have some re-usability. What would be best is some code that crops out any text that is bounded by the span.

Upvotes: 6

Views: 1300

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627344

You can use extract() to remove all span tags:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

Upvotes: 5

Dušan Maďar
Dušan Maďar

Reputation: 9909

One solution is to check if the string contains html:

from bs4 import BeautifulSoup

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if bool(BeautifulSoup(str(content), "html.parser").find()):
            continue

        print content

Another solution (which I prefer) is to check for instance of bs4.element.Tag:

import bs4

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if isinstance(content, bs4.element.Tag):
            continue

        print content

Upvotes: 0

Related Questions