goxywood
goxywood

Reputation: 121

How to extract tables from different pages? (python)

I want to extract the tables of first serval pages on http://

The tables have been scraped by the code below and they are in a list, import urllib from bs4 import BeautifulSoup

base_url = "http://"
url_list = ["{}?page={}".format(base_url, str(page)) for page in range(1, 21)]

mega = []
for url in url_list:
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table', {'class': 'table table-bordered table-striped table-hover'}) 
    mega.append(table)

Because it is a list and cannot use 'soup find_all' to extract the items I want so I converted them into bs4.element.Tag to further serach the items

for i in mega:
    trs = table.find_all('tr')[1:]
    rows = list()
    for tr in trs:
        rows.append([td.text.replace('\n', '').replace('\xa0', '').replace('\t', '').strip().rstrip() for td in tr.find_all('td')])
rows

The rows only extract the tables of last page. What is the problem of my codes so the previous 19 tables are not been extracted? Thanks!

The length of the two items are not equivalent.I used for i in meaga to obetain i.

len(mega) = 20
len(i) = 5

Upvotes: 1

Views: 134

Answers (1)

mHvNG
mHvNG

Reputation: 671

The problem is pretty simple. In this for loop:

for i in mega:
    trs = table.find_all('tr')[1:]
    rows = list()
    for tr in trs:
        rows.append([td.text.replace('\n', '').replace('\xa0', '').replace('\t', '').strip().rstrip() for td in tr.find_all('td')])

You initialize rows = list() in the for loop. So you loop 21 times, but you also empty the list 20 times.

So you need to have it like this:

rows = list()
for i in mega:
    trs = table.find_all('tr')[1:]
    for tr in trs:
        rows.append([td.text.replace('\n', '').replace('\xa0', '').replace('\t', '').strip().rstrip() for td in tr.find_all('td')])

Upvotes: 1

Related Questions