Ben
Ben

Reputation: 4984

Extra newline character for children of Beautiful Soup

I use BeautifulSoup on a snippet of html as follows:

 s = """<div class="views-row views-row-1 views-row-odd views-row-  first">
            <span class="views-field views-field-title"> 
                <span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
                </span> 
            </span>
            <span class="views-field views-field-created"> 
                <span class="field-content">Friday, March 20, 2015
                </span> 
           </span> 
</div>""" 

soup = BeautifulSoup(s)

Why does s.span only return the first span tag?

Moreover s.contents returns a list of length 4. Both span tags are in the list but the 0th and 2nd index are "\n$ new line characters. The new line character is useless. Is there a reason why this is done?

Upvotes: 2

Views: 2125

Answers (1)

alecxe
alecxe

Reputation: 473873

Why does s.span only return the first span tag?

s.span is a shortcut to s.find('span') which would find the first occurrence of the span tag only.

Moreover s.contents returns a list of length 4. Both span tags are in the list but the 0th and 2nd index are "\n$ new line characters. The new line character is useless. Is there a reason why this is done?

By definition, .contents outputs a list of all element's children, including text nodes - instances of NavigableString class.

If you want the tags only, you can use find_all():

soup.find_all()

And, if only span tags:

soup.find_all('span')

Example:

>>> from bs4 import BeautifulSoup
>>> s = """<div class="views-row views-row-1 views-row-odd views-row-  first">
...             <span class="views-field views-field-title"> 
...                 <span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
...                 </span> 
...             </span>
...             <span class="views-field views-field-created"> 
...                 <span class="field-content">Friday, March 20, 2015
...                 </span> 
...            </span> 
... </div>""" 
>>> soup = BeautifulSoup(s)
>>> for span in soup.find_all('span'):
...     print span.text.strip()
... 
Love Heals
Love Heals
Friday, March 20, 2015
Friday, March 20, 2015

The reason for the duplicates is that there are nested span elements. You can fix it in different ways. For example, you can make the search inside the div only with recursive=False:

>>> for span in soup.find('div', class_='views-row-1').find_all('span', recursive=False):
...     print span.text.strip()
... 
Love Heals
Friday, March 20, 2015

Or, you can make use of CSS Selectors:

>>> for span in soup.select('div.views-row-1 > span'):
...     print span.text.strip()
... 
Love Heals
Friday, March 20, 2015

Upvotes: 3

Related Questions