Reputation: 121
I want to extract the tables of first serval pages on http://
The tables have been scraped by the code below and they are in a list, import urllib from bs4 import BeautifulSoup
base_url = "http://"
url_list = ["{}?page={}".format(base_url, str(page)) for page in range(1, 21)]
mega = []
for url in url_list:
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'class': 'table table-bordered table-striped table-hover'})
mega.append(table)
Because it is a list and cannot use 'soup find_all' to extract the items I want so I converted them into bs4.element.Tag to further serach the items
for i in mega:
trs = table.find_all('tr')[1:]
rows = list()
for tr in trs:
rows.append([td.text.replace('\n', '').replace('\xa0', '').replace('\t', '').strip().rstrip() for td in tr.find_all('td')])
rows
The rows only extract the tables of last page. What is the problem of my codes so the previous 19 tables are not been extracted? Thanks!
The length of the two items are not equivalent.I used for i in meaga to obetain i.
len(mega) = 20
len(i) = 5
Upvotes: 1
Views: 134
Reputation: 671
The problem is pretty simple. In this for loop:
for i in mega:
trs = table.find_all('tr')[1:]
rows = list()
for tr in trs:
rows.append([td.text.replace('\n', '').replace('\xa0', '').replace('\t', '').strip().rstrip() for td in tr.find_all('td')])
You initialize rows = list()
in the for loop. So you loop 21 times, but you also empty the list 20 times.
So you need to have it like this:
rows = list()
for i in mega:
trs = table.find_all('tr')[1:]
for tr in trs:
rows.append([td.text.replace('\n', '').replace('\xa0', '').replace('\t', '').strip().rstrip() for td in tr.find_all('td')])
Upvotes: 1