Reputation: 116
Python 3 programmer, new to BeautifulSoup and HTMLParser. I'm using BeautifulSoup to fetch all the definition list data from an HTML file, and try to store dt data and dd data into python dictionary as key value pairs correspondingly. My HTML file (List_page.html) is:
<!DOCTYPE html>
<html lang="en">
<head>STH here</head>
<body>
<!--some irrelavent things here-->
<dl class="key_value">
<dt>Sine</dt>
<dd>The ratio of the length of the opposite side to the length of the hypotenuse.</dd>
<dt>Cosine</dt>
<dd>The ratio of the length of the adjacent side to the length of the hypotenuse.</dd>
</dl>
<!--some irrelavent things here-->
</body>
</html>
whereas when my Python code is:
from bs4 import BeautifulSoup
from html.parser import HTMLParser
dt = []
dd = []
dl = {}
class DTParser(HTMLParser):
def handle_data(self, data):
dt.append(data)
class DDParser(HTMLParser):
def handle_data(self, data):
dd.append(data)
html_page = open("List_page.html")
soup = BeautifulSoup(html_page, features="lxml")
dts = soup.select("dt")
parser = DTParser()
# Start of part 1:
parser.feed(str(dts[0]).replace('\n', ''))
parser.feed(str(dts[1]).replace('\n', ''))
# end of part 1
dds = soup.select("dd")
parser = DDParser()
# Start of part 2
parser.feed(str(dds[0]).replace('\n', ''))
parser.feed(str(dds[1]).replace('\n', ''))
# End of part 2
dl = dict(zip(dt, dd))
print(dl)
output is:
This outputs the stuff correctly as expected. However, when I replace part 1 (or 2) with for loop, it starts to go wrong:
for example, code:
# Similar change for part 2
for dt in dts:
parser.feed(str(dts[0]).replace('\n', ''))
in this case only tells me the definition of Cosine, not Sine. With 2 items, I can do this without a loop. But what if I got more items? So want to know a correct way to do this. Thanks.
Upvotes: 0
Views: 403
Reputation: 6112
You are getting the first element of dts in the for loop each iteration with dts[0]
instead of updating the index with the loop. Change it to:
for i in range(len(dts)):
parser.feed(str(dts[i]).replace('\n', ''))
and
for i in range(len(dds)):
parser.feed(str(dds[i]).replace('\n', ''))
Upvotes: 2