Lesmana
Lesmana

Reputation: 27073

BeautifulSoup .children or .content without whitespace between tags

I want all children of a tag without the whitespace between the tags. But BeautifulSoups .contents and .children also returns the whitespace between the tags.

from bs4 import BeautifulSoup
html = """
<div id="list">
  <span>1</span>
  <a href="2.html">2</a>
  <a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').contents)

This prints:

['\n', <span>1</span>, '\n', <a href="2.html">2</a>, '\n', <a href="3.html">3</a>, '\n']

Same with

print(list(soup.find(id='list').children))

What I want:

[<span>1</span>, <a href="2.html">2</a>, <a href="3.html">3</a>]

Is there any way to tell BeautifulSoup to return only the tags and ignore the whitespace?

The documentation is not very helpful on this topic. The html in the example does not contain any whitespace between tags.

Indeed stripping the html of all whitespace between tags solves my problem:

html = """<div id="list"><span>1</span><a href="2.html">2</a><a href="3.html">3</a></div>"""

Using this html I get the tags without whitespace between the tags because there is no whitespace between the tags. But I hoped to use BeautifoulSoup so I do not have to mess around in the html source code. I was hoping BeautifulSoup does that for me.

Another workaround might be:

print(list(filter(lambda t: t != '\n', soup.find(id='list').contents)))

But that seems flaky. Is the whitespace guaranteed to be always exactly '\n'?


A note to the duplicate marking brigade:

There are many questions asking about BeautifulSoup and whitespace. Most are asking about getting rid of whitespace from the "rendered text".

For example:

BeautifulSoup - getting rid of paragraph whitespace/line breaks

Removing new line '\n' from the output of python BeautifulSoup

Both questions want the text without whitespace. I want the tags without whitespace. The solutions there don't apply to my question.

Another example:

Regular expression for class with whitespaces using Beautifulsoup

This question is about whitespace in the class attribute.

Upvotes: 4

Views: 1556

Answers (1)

Lesmana
Lesmana

Reputation: 27073

BeautifulSoup has .find_all(True) which returns all tags without the whitespace between the tags:

from bs4 import BeautifulSoup
html = """
<div id="list">
  <span>1</span>
  <a href="2.html">2</a>
  <a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').find_all(True))

Prints:

[<span>1</span>, <a href="2.html">2</a>, <a href="3.html">3</a>]

Combine with recursive=False, and you get only the direct children and not children of children.

to demonstrate I added <b> to the second child. which would be a grandchild.

from bs4 import BeautifulSoup
html = """
<div id="list">
  <span>1</span>
  <a href="2.html"><b>2</b></a>
  <a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').find_all(True, recursive=False))

with recursive=False it prints:

[<span>1</span>, <a href="2.html"><b>2</b></a>, <a href="3.html">3</a>]

with recursive=True it prints:

[<span>1</span>, <a href="2.html"><b>2</b></a>, <b>2</b>, <a href="3.html">3</a>]

Trivia: now that I have the solution I found another seemingly unrelated question and answer in StackOverflow where the solution was hidden in a comment:

Why does BeautifulSoup .children contain nameless elements as well as the expected tag(s)

Upvotes: 4

Related Questions