Reputation: 373
I would like to extract data between span tags. Here is a sample of html code:
<p>
<span class="html-italic">3-Acetyl-</span>
<span class="html-italic">(4-acetyl-5-(β</span>
"-"
<span class="html-italic">naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one</span>
"("
<b>5b</b>
</p>
I need to get a full name:
3-Acetyl-4-acetyl-5-(β-naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one
(without 5b). I don't know how to extract '-' between the second and the third span tags. Also, a total number of span tags may vary and '-' can be between any span tags. The code I wrote gives me only: 3-Acetyl-4-acetyl-5-(β. Here is a part of my code:
p = soup.find("p")
name = ""
for child in p.children:
if child.name == "span":
name += child.text
print name
Any help is highly appreciated!
Upvotes: 2
Views: 9728
Reputation: 2446
If you like one-liners, you can do something like:
(your_item.find("p", {"attr": "value"})).find("span").get_text()
Upvotes: 1
Reputation: 630
You can use BeautifulSoup's .findAll(text=True)
to get all text inside the element, including outside the spans. This returns a list of text parts, which need to be stripped of whitespace and quotation marks. I'm not sure what rule you're using to exclude the last "("5b
but maybe it's as easy as slicing the list:
parts = soup.find("p").findAll(text=True)
name = ''.join(p.strip(string.whitespace + '"') for p in parts[:-3])
Result:
u'3-Acetyl-(4-acetyl-5-(β-naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one'
Upvotes: 2
Reputation: 1913
you can just do something like
p = soup.find("p")
name = ""
for child in p.children:
if child.name == "span":
name += child.text
elif child.name is 'None':
name += child.string.rstrip("\"\n ").lstrip("\"\n ")
print name
Upvotes: 2
Reputation: 174696
You could use CSS selectors.
>>> ''.join(i.text for i in soup.select('p > span'))
'3-Acetyl-(4-acetyl-5-(βnaphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one'
Upvotes: 3
Reputation: 19733
try like this:
name=""
for x in soup.find('p'):
try:
if x.name == 'span':
name += x.get_text()
except:pass
print name
output:
3-Acetyl-(4-acetyl-5-(βnaphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one
Upvotes: 1