Reputation: 27073
I want all children of a tag without the whitespace between the tags. But BeautifulSoups .contents
and .children
also returns the whitespace between the tags.
from bs4 import BeautifulSoup
html = """
<div id="list">
<span>1</span>
<a href="2.html">2</a>
<a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').contents)
This prints:
['\n', <span>1</span>, '\n', <a href="2.html">2</a>, '\n', <a href="3.html">3</a>, '\n']
Same with
print(list(soup.find(id='list').children))
What I want:
[<span>1</span>, <a href="2.html">2</a>, <a href="3.html">3</a>]
Is there any way to tell BeautifulSoup to return only the tags and ignore the whitespace?
The documentation is not very helpful on this topic. The html in the example does not contain any whitespace between tags.
Indeed stripping the html of all whitespace between tags solves my problem:
html = """<div id="list"><span>1</span><a href="2.html">2</a><a href="3.html">3</a></div>"""
Using this html I get the tags without whitespace between the tags because there is no whitespace between the tags. But I hoped to use BeautifoulSoup so I do not have to mess around in the html source code. I was hoping BeautifulSoup does that for me.
Another workaround might be:
print(list(filter(lambda t: t != '\n', soup.find(id='list').contents)))
But that seems flaky. Is the whitespace guaranteed to be always exactly '\n'
?
A note to the duplicate marking brigade:
There are many questions asking about BeautifulSoup and whitespace. Most are asking about getting rid of whitespace from the "rendered text".
For example:
BeautifulSoup - getting rid of paragraph whitespace/line breaks
Removing new line '\n' from the output of python BeautifulSoup
Both questions want the text without whitespace. I want the tags without whitespace. The solutions there don't apply to my question.
Another example:
Regular expression for class with whitespaces using Beautifulsoup
This question is about whitespace in the class attribute.
Upvotes: 4
Views: 1556
Reputation: 27073
BeautifulSoup has .find_all(True)
which returns all tags without the whitespace between the tags:
from bs4 import BeautifulSoup
html = """
<div id="list">
<span>1</span>
<a href="2.html">2</a>
<a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').find_all(True))
Prints:
[<span>1</span>, <a href="2.html">2</a>, <a href="3.html">3</a>]
Combine with recursive=False
, and you get only the direct children and not children of children.
to demonstrate I added <b>
to the second child. which would be a grandchild.
from bs4 import BeautifulSoup
html = """
<div id="list">
<span>1</span>
<a href="2.html"><b>2</b></a>
<a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').find_all(True, recursive=False))
with recursive=False
it prints:
[<span>1</span>, <a href="2.html"><b>2</b></a>, <a href="3.html">3</a>]
with recursive=True
it prints:
[<span>1</span>, <a href="2.html"><b>2</b></a>, <b>2</b>, <a href="3.html">3</a>]
Trivia: now that I have the solution I found another seemingly unrelated question and answer in StackOverflow where the solution was hidden in a comment:
Why does BeautifulSoup .children contain nameless elements as well as the expected tag(s)
Upvotes: 4