Reputation: 373

extracting data between span tags with BeautifulSoup Python

I would like to extract data between span tags. Here is a sample of html code:

<p>
    <span class="html-italic">3-Acetyl-</span>
    <span class="html-italic">(4-acetyl-5-(β</span>
    "-"
    <span class="html-italic">naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one</span>
     "("
    <b>5b</b>
</p>

I need to get a full name:

3-Acetyl-4-acetyl-5-(β-naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one (without 5b). I don't know how to extract '-' between the second and the third span tags. Also, a total number of span tags may vary and '-' can be between any span tags. The code I wrote gives me only: 3-Acetyl-4-acetyl-5-(β. Here is a part of my code:

p = soup.find("p")
name = ""
for child in p.children:
    if child.name == "span":
        name += child.text
print name

Any help is highly appreciated!

Upvotes: 2

Answers (5)

zevij

Reputation: 2446

If you like one-liners, you can do something like:

(your_item.find("p", {"attr": "value"})).find("span").get_text()

Upvotes: 1

Dan Rice

Reputation: 630

You can use BeautifulSoup's .findAll(text=True) to get all text inside the element, including outside the spans. This returns a list of text parts, which need to be stripped of whitespace and quotation marks. I'm not sure what rule you're using to exclude the last "("5b but maybe it's as easy as slicing the list:

parts = soup.find("p").findAll(text=True)
name = ''.join(p.strip(string.whitespace + '"') for p in parts[:-3])

Result:

u'3-Acetyl-(4-acetyl-5-(β-naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one'

Upvotes: 2

conrad

Reputation: 1913

you can just do something like

p = soup.find("p")
name = ""
for child in p.children:
    if child.name == "span":
        name += child.text
    elif child.name is 'None':
        name += child.string.rstrip("\"\n ").lstrip("\"\n ")
print name

Upvotes: 2

Avinash Raj

Reputation: 174834

You could use CSS selectors.

>>> ''.join(i.text for i in soup.select('p > span'))
'3-Acetyl-(4-acetyl-5-(βnaphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one'

Upvotes: 3

Hackaholic

Reputation: 19763

try like this:

name=""
for x in soup.find('p'):
    try:
        if x.name == 'span':
            name += x.get_text()
    except:pass
print name

output:

3-Acetyl-(4-acetyl-5-(Î˛naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one

Upvotes: 1

extracting data between span tags with BeautifulSoup Python

Answers (5)

Related Questions