Beautifulsoup get content without next tag

I have some html code like this

<p><span class="map-sub-title">abc</span>123</p>

I used Beautifulsoup,and here's my code :

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
p = soup1.text

I get the result 'abc123'

But I want to get the result '123' not 'abc123'

Upvotes: 2

Answers (5)

BioGeek

Reputation: 22887

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'

Upvotes: 0

Keyur Potdar

Reputation: 7238

One of the many ways, is to use contents over the parent tag (in this case it's <p>).

If you know the position of the string, you can directly use this:

>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'

If, you want a generalized solution, where you don't know the position, you can check if the type of content is NavigableString like this:

>>> final_text = [x for x in soup.find('p').contents if isinstance(x, NavigableString)]
>>> final_text
['123']

With the second method, you'll be able to get all the text that is directly a child of the <p> tag. For completeness's sake, here's one more example:

>>> html = '''
... <p>
...     I want
...     <span class="map-sub-title">abc</span>
...     foo
...     <span class="map-sub-title">abc2</span>
...     text
...     <span class="map-sub-title">abc3</span>
...     only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'

Upvotes: 1

Zroq

Reputation: 8392

Although every response on this thread seems acceptable I shall point out another method for this case:

soup.find("span", {'class':'map-sub-title'}).next_sibling

You can use next_sibling to navigate between elements that are on the same parent, in this case the p tag.

Upvotes: 1

PyMaster

Reputation: 1174

You can also use extract() to remove unwanted tag before you get the text from tag like below.

from bs4 import BeautifulSoup

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()

print(soup1.text)

Upvotes: 1

Rafael

Reputation: 7242

You can use the function decompose() to remove the span tag and then get the text you want.

from bs4 import BeautifulSoup

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")

for span in soup.find_all("span", {'class':'map-sub-title'}):
    span.decompose()

print(soup.text)

Upvotes: 1

Beautifulsoup get content without next tag

Answers (5)

Related Questions