user47467
user47467

Reputation: 1093

Beautifulsoup parsing html line breaks

I'm using BeautifulSoup to parse some HTML from a text file. The text is written to a dictionary like so:

websites = ["1"]

html_dict = {}

for website_id in websites:
    with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:   
        get_raw_html = out.read().splitlines()
        html_dict.update({website_id:get_raw_html})

I parse the HTML from html_dict = {} to find texts with the <p> tag:

scraped = {}

for website_id in html_dict.keys():
    scraped[website_id] = []
    raw_html = html_dict[website_id]
    for i in raw_html:
        soup = BeautifulSoup(i, 'html.parser')
        scrape_selected_tags = soup.find_all('p')

This is what the HTML in html_dict looks like:

<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>

The problem is, BeautifulSoup seems to be considering the line break and ignoring the second line. So when i print out scrape_selected_tags the output is...

<p>Hey, this should be scraped</p>

when I would expect the whole text.

How can I avoid this? I've tried splitting the lines in html_dict and it doesn't seem to work. Thanks in advance.

Upvotes: 1

Views: 3205

Answers (1)

t.m.adam
t.m.adam

Reputation: 15376

By calling splitlines when you read your html documents you break the tags in a list of strings.
Instead you should read all the html in a string.

websites = ["1"]
html_dict = {}

for website_id in websites:
    with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:   
        get_raw_html = out.read()
        html_dict.update({website_id:get_raw_html})

Then remove the inner for loop, so you won't iterate over that string.

scraped = {}

for website_id in html_dict.keys():
    scraped[website_id] = []
    raw_html = html_dict[website_id]
    soup = BeautifulSoup(raw_html, 'html.parser')
    scrape_selected_tags = soup.find_all('p')

BeautifulSoup can handle newlines inside tags, let me give you an example:

html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''

soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p'))

[<p>Hey, this should be scraped\nbut this part gets ignored for some reason.</p>]

But if you split one tag in multiple BeautifulSoup objects:

html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''

for line in html.splitlines():
    soup = BeautifulSoup(line, 'html.parser')
    print(soup.find_all('p'))

[<p>Hey, this should be scraped</p>]
[]

Upvotes: 1

Related Questions