Reputation: 1093
I'm using BeautifulSoup to parse some HTML from a text file. The text is written to a dictionary like so:
websites = ["1"]
html_dict = {}
for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read().splitlines()
html_dict.update({website_id:get_raw_html})
I parse the HTML from html_dict = {}
to find texts with the <p>
tag:
scraped = {}
for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all('p')
This is what the HTML in html_dict
looks like:
<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>
The problem is, BeautifulSoup seems to be considering the line break and ignoring the second line. So when i print out scrape_selected_tags
the output is...
<p>Hey, this should be scraped</p>
when I would expect the whole text.
How can I avoid this? I've tried splitting the lines in html_dict
and it doesn't seem to work. Thanks in advance.
Upvotes: 1
Views: 3205
Reputation: 15376
By calling splitlines
when you read your html documents you break the tags in a list of strings.
Instead you should read all the html in a string.
websites = ["1"]
html_dict = {}
for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read()
html_dict.update({website_id:get_raw_html})
Then remove the inner for loop, so you won't iterate over that string.
scraped = {}
for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
soup = BeautifulSoup(raw_html, 'html.parser')
scrape_selected_tags = soup.find_all('p')
BeautifulSoup can handle newlines inside tags, let me give you an example:
html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p'))
[<p>Hey, this should be scraped\nbut this part gets ignored for some reason.</p>]
But if you split one tag in multiple BeautifulSoup
objects:
html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''
for line in html.splitlines():
soup = BeautifulSoup(line, 'html.parser')
print(soup.find_all('p'))
[<p>Hey, this should be scraped</p>]
[]
Upvotes: 1