Reputation: 23
I have some html code like this
<p><span class="map-sub-title">abc</span>123</p>
I used Beautifulsoup,and here's my code :
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
p = soup1.text
I get the result 'abc123'
But I want to get the result '123' not 'abc123'
Upvotes: 2
Views: 587
Reputation: 22887
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings
generator:
>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'
Upvotes: 0
Reputation: 7238
One of the many ways, is to use contents
over the parent tag (in this case it's <p>
).
If you know the position of the string, you can directly use this:
>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'
If, you want a generalized solution, where you don't know the position, you can check if the type of content is NavigableString
like this:
>>> final_text = [x for x in soup.find('p').contents if isinstance(x, NavigableString)]
>>> final_text
['123']
With the second method, you'll be able to get all the text that is directly a child of the <p>
tag. For completeness's sake, here's one more example:
>>> html = '''
... <p>
... I want
... <span class="map-sub-title">abc</span>
... foo
... <span class="map-sub-title">abc2</span>
... text
... <span class="map-sub-title">abc3</span>
... only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'
Upvotes: 1
Reputation: 8392
Although every response on this thread seems acceptable I shall point out another method for this case:
soup.find("span", {'class':'map-sub-title'}).next_sibling
You can use next_sibling
to navigate between elements that are on the same parent
, in this case the p
tag.
Upvotes: 1
Reputation: 1174
You can also use extract()
to remove unwanted tag before you get the text from tag like below.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()
print(soup1.text)
Upvotes: 1
Reputation: 7242
You can use the function decompose()
to remove the span tag and then get the text you want.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")
for span in soup.find_all("span", {'class':'map-sub-title'}):
span.decompose()
print(soup.text)
Upvotes: 1