Sam Y
Sam Y

Reputation: 116

Loop through BeautifulSoup list and parse each to HTML tags and data problem

Python 3 programmer, new to BeautifulSoup and HTMLParser. I'm using BeautifulSoup to fetch all the definition list data from an HTML file, and try to store dt data and dd data into python dictionary as key value pairs correspondingly. My HTML file (List_page.html) is:

<!DOCTYPE html>
<html lang="en">
<head>STH here</head>
<body>
    <!--some irrelavent things here-->
    <dl class="key_value">
        <dt>Sine</dt>
        <dd>The ratio of the length of the opposite side to the length of the hypotenuse.</dd>
        <dt>Cosine</dt>
        <dd>The ratio of the length of the adjacent side to the length of the hypotenuse.</dd>
    </dl>
    <!--some irrelavent things here-->    
</body>
</html>

whereas when my Python code is:

from bs4 import BeautifulSoup
from html.parser import HTMLParser

dt = []
dd = []
dl = {}

class DTParser(HTMLParser):
    def handle_data(self, data):
        dt.append(data)

class DDParser(HTMLParser):
    def handle_data(self, data):
        dd.append(data)

html_page = open("List_page.html")
soup = BeautifulSoup(html_page, features="lxml")

dts = soup.select("dt")
parser = DTParser()

# Start of part 1:
parser.feed(str(dts[0]).replace('\n', ''))
parser.feed(str(dts[1]).replace('\n', ''))
# end of part 1

dds = soup.select("dd")
parser = DDParser()

# Start of part 2
parser.feed(str(dds[0]).replace('\n', ''))
parser.feed(str(dds[1]).replace('\n', ''))
# End of part 2

dl = dict(zip(dt, dd))
print(dl)

output is:

enter image description here

This outputs the stuff correctly as expected. However, when I replace part 1 (or 2) with for loop, it starts to go wrong:

for example, code:

# Similar change for part 2
for dt in dts:
    parser.feed(str(dts[0]).replace('\n', ''))

in this case only tells me the definition of Cosine, not Sine. With 2 items, I can do this without a loop. But what if I got more items? So want to know a correct way to do this. Thanks.

Upvotes: 0

Views: 403

Answers (1)

alec
alec

Reputation: 6112

You are getting the first element of dts in the for loop each iteration with dts[0] instead of updating the index with the loop. Change it to:

for i in range(len(dts)):
    parser.feed(str(dts[i]).replace('\n', ''))

and

for i in range(len(dds)):
    parser.feed(str(dds[i]).replace('\n', ''))

Upvotes: 2

Related Questions