FaCoffee
FaCoffee

Reputation: 7929

Python: skip lines while parsing html code and get rid of white spaces

I have the following html code:

html_doc = """
<h2> API guidance for developers</h2>
<h2>Images</h2>
<h2>Score descriptors</h2>
<h2>Downloadable XML data files (updated daily)</h2>
<h2>
                                    East Counties</h2>
<h2>
                                    East Midlands</h2>
<h2>
                                    London</h2>
<h2>
                                    North East</h2>
<h2>
                                    North West</h2>
<h2>
                                    South East</h2>
<h2>
                                    South West</h2>
<h2>
                                    West Midlands</h2>
<h2>
                                    Yorkshire and Humberside</h2>
<h2>
                                    Northern Ireland</h2>
<h2>
                                    Scotland</h2>
<h2>
                                    Wales</h2>
"""

How can I skip the first four lines and access the text strings such as East Counties and so forth?

My attempt does not skip the first four lines and returns the strings including the many white spaces embedded in the code (which I want to get rid of):

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for h2 in soup.find_all('h2'):
    next
    next
    next
    next
    print (str(h2.children.next()))

The desired result:

East Counties
East Midlands
London
North East
...

What am I doing wrong?

Upvotes: 1

Views: 467

Answers (2)

akash karothiya
akash karothiya

Reputation: 5950

You can use slicing here, as find_all returns a list type so you can play around with it's index, like [4:] and to ignore white spaces use strip()

for h2 in soup.find_all('h2')[4:]:
    print(h2.text.strip())

East Counties
East Midlands
London
North East
North West
...    

Upvotes: 4

James Kent
James Kent

Reputation: 5933

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

for h2 in soup.find_all('h2')[4:]: # slicing to skip the first 4 elements
    print(h2.text.strip()) # get the inner text of the tag and then strip the white space

Upvotes: 2

Related Questions