Problems extracting text from multiple tags ignoring subtags

Question

I have this sample-html:

soup=BeautifulSoup('''
 
 A. 
Text I want 
 
 B.                           
Second text I want''')

I'm trying to extract "Text I want" and "Second text I want", ignoring the span tags. So far what I have done:

soup.li.find_all(text=True,recursive=False)

Which returns [' ', ' Text I want '].

If I try:

for s in soup.ul:
    print(s.find(text=True,recursive=False))

I get an error:

TypeError: find() takes no keyword arguments
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
      1 for s in soup.ul:
----> 2     print(s.find(text=True,recursive=False))

TypeError: find() takes no keyword arguments

Any help is appreciated.

Andrej Kesely · Accepted Answer

You can use list-comprehension to extract the texts:

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    """
 
 A. 
Text I want 
 
 B.                           
Second text I want""",
    "html.parser",
)

texts = [
    txt
    for li in soup.select("li.item")
    for t in li.find_all(text=True, recursive=False)
    if (txt := t.strip())
]
print(texts)

Prints:

['Text I want', 'Second text I want']

Or remove the first and then get text:

for span in soup.select("span"):
    span.extract()

texts = [li.get_text(strip=True) for li in soup.select("li.item")]
print(texts)

Prints:

['Text I want', 'Second text I want']

Or: Find and then .find_next_sibling(text=True):

texts = [
    li.find_next_sibling(text=True).strip()
    for li in soup.select("li.item span")
]
print(texts)

Prints:

['Text I want', 'Second text I want']

Problems extracting text from multiple tags ignoring subtags

Answers (1)

Related Questions