Oscar Rodriguez
Oscar Rodriguez

Reputation: 3

Extracting tables from web

I need to extract all tables from this web:(only the second column) https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表

Well, the last three tables I don't need it...

However, my code only extract the second column from the first table.

 import pickle
 import requests
 def save_china_tickers():
     resp = requests.get('https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表')
     soup = bs.BeautifulSoup(resp.text, 'lxml')
     table = soup.find('table', {'class':'wikitable'})
     tickers=[]
     for row in table.findAll('tr')[1:]:
         ticker = row.findAll('td')[1].text
         tickers.append(ticker)
             with open('chinatickers.pickle','wb') as f:
         pickle.dump(tickers,f)
         return tickers save_china_tickers()

Upvotes: 0

Views: 55

Answers (1)

tbhaxor
tbhaxor

Reputation: 1943

I have an easy method.

  1. Get HTTP Response
  2. Find all tables using RegEx
  3. Parse HTML Table to list of lists
  4. Iterate the over each list in list
Requirements
  1. dashtable
Code
from urllib.request import urlopen
from dashtable import html2data # to convert html table to list of list
import re

url = "https://zh.wikipedia.org/wiki/%E4%B8%8A%E6%B5%B7%E8%AF%81%E5%88%B8%E4%BA%A4%E6%98%93%E6%89%80%E4%B8%8A%E5%B8%82%E5%85%AC%E5%8F%B8%E5%88%97%E8%A1%A8"

# Reading http content
data = urlopen(url).read().decode()

# now fetching all tables with the help of regex
tables = ["<table>{}</table>".format(table) for table in re.findall(r"<table .*?>(.*?)</table>", data, re.M|re.S|re.I)]

# parsing data
parsed_tables = [html2data(table)[0] for table in tables]  # html2data returns a tuple with 0th index as list of lists



# lets take first table ie 600000-600099
parsed = parsed_tables[0]

# column names of first table
print(parsed[0])

# rows of first table 2nd column
for index in range(1, len(parsed)):
    print(parsed[index][1])


"""
Output: All the rows of table 1, column 2 excluding the headers
"""

Upvotes: 0

Related Questions