Beautifulsoup parsing html line breaks

Question

I'm using BeautifulSoup to parse some HTML from a text file. The text is written to a dictionary like so:

websites = ["1"]

html_dict = {}

for website_id in websites:
    with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:   
        get_raw_html = out.read().splitlines()
        html_dict.update({website_id:get_raw_html})

I parse the HTML from html_dict = {} to find texts with the

tag:

scraped = {}

for website_id in html_dict.keys():
    scraped[website_id] = []
    raw_html = html_dict[website_id]
    for i in raw_html:
        soup = BeautifulSoup(i, 'html.parser')
        scrape_selected_tags = soup.find_all('p')

This is what the HTML in html_dict looks like:

Hey, this should be scraped
but this part gets ignored for some reason.

The problem is, BeautifulSoup seems to be considering the line break and ignoring the second line. So when i print out scrape_selected_tags the output is...

Hey, this should be scraped

when I would expect the whole text.

How can I avoid this? I've tried splitting the lines in html_dict and it doesn't seem to work. Thanks in advance.

Beautifulsoup parsing html line breaks

Answers (1)

Related Questions