Reputation: 7929
I have the following html
code:
html_doc = """
<h2> API guidance for developers</h2>
<h2>Images</h2>
<h2>Score descriptors</h2>
<h2>Downloadable XML data files (updated daily)</h2>
<h2>
East Counties</h2>
<h2>
East Midlands</h2>
<h2>
London</h2>
<h2>
North East</h2>
<h2>
North West</h2>
<h2>
South East</h2>
<h2>
South West</h2>
<h2>
West Midlands</h2>
<h2>
Yorkshire and Humberside</h2>
<h2>
Northern Ireland</h2>
<h2>
Scotland</h2>
<h2>
Wales</h2>
"""
How can I skip the first four lines and access the text strings such as East Counties
and so forth?
My attempt does not skip the first four lines and returns the strings including the many white spaces embedded in the code (which I want to get rid of):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for h2 in soup.find_all('h2'):
next
next
next
next
print (str(h2.children.next()))
The desired result:
East Counties
East Midlands
London
North East
...
What am I doing wrong?
Upvotes: 1
Views: 467
Reputation: 5950
You can use slicing
here, as find_all
returns a list type so you can play around with it's index, like [4:]
and to ignore white spaces use strip()
for h2 in soup.find_all('h2')[4:]:
print(h2.text.strip())
East Counties
East Midlands
London
North East
North West
...
Upvotes: 4
Reputation: 5933
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for h2 in soup.find_all('h2')[4:]: # slicing to skip the first 4 elements
print(h2.text.strip()) # get the inner text of the tag and then strip the white space
Upvotes: 2