chemist
chemist

Reputation: 373

extracting data between span tags with BeautifulSoup Python

I would like to extract data between span tags. Here is a sample of html code:

<p>
    <span class="html-italic">3-Acetyl-</span>
    <span class="html-italic">(4-acetyl-5-(β</span>
    "-"
    <span class="html-italic">naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one</span>
     "("
    <b>5b</b>
</p>

I need to get a full name:

3-Acetyl-4-acetyl-5-(β-naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one (without 5b). I don't know how to extract '-' between the second and the third span tags. Also, a total number of span tags may vary and '-' can be between any span tags. The code I wrote gives me only: 3-Acetyl-4-acetyl-5-(β. Here is a part of my code:

p = soup.find("p")
name = ""
for child in p.children:
    if child.name == "span":
        name += child.text
print name

Any help is highly appreciated!

Upvotes: 2

Views: 9728

Answers (5)

zevij
zevij

Reputation: 2446

If you like one-liners, you can do something like:

(your_item.find("p", {"attr": "value"})).find("span").get_text()

Upvotes: 1

Dan Rice
Dan Rice

Reputation: 630

You can use BeautifulSoup's .findAll(text=True) to get all text inside the element, including outside the spans. This returns a list of text parts, which need to be stripped of whitespace and quotation marks. I'm not sure what rule you're using to exclude the last "("5b but maybe it's as easy as slicing the list:

parts = soup.find("p").findAll(text=True)
name = ''.join(p.strip(string.whitespace + '"') for p in parts[:-3])

Result:

u'3-Acetyl-(4-acetyl-5-(β-naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one'

Upvotes: 2

conrad
conrad

Reputation: 1913

you can just do something like

p = soup.find("p")
name = ""
for child in p.children:
    if child.name == "span":
        name += child.text
    elif child.name is 'None':
        name += child.string.rstrip("\"\n ").lstrip("\"\n ")
print name

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174696

You could use CSS selectors.

>>> ''.join(i.text for i in soup.select('p > span'))
'3-Acetyl-(4-acetyl-5-(βnaphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one'

Upvotes: 3

Hackaholic
Hackaholic

Reputation: 19733

try like this:

name=""
for x in soup.find('p'):
    try:
        if x.name == 'span':
            name += x.get_text()
    except:pass
print name

output:

3-Acetyl-(4-acetyl-5-(βnaphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one

Upvotes: 1

Related Questions