Extracting tables from web

Question

I need to extract all tables from this web:(only the second column) https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表

Well, the last three tables I don't need it...

However, my code only extract the second column from the first table.

 import pickle
 import requests
 def save_china_tickers():
     resp = requests.get('https://zh.wikipedia.org/wiki/上海证券交易所上市公司列表')
     soup = bs.BeautifulSoup(resp.text, 'lxml')
     table = soup.find('table', {'class':'wikitable'})
     tickers=[]
     for row in table.findAll('tr')[1:]:
         ticker = row.findAll('td')[1].text
         tickers.append(ticker)
             with open('chinatickers.pickle','wb') as f:
         pickle.dump(tickers,f)
         return tickers save_china_tickers()

tbhaxor · Accepted Answer

I have an easy method.

Get HTTP Response
Find all tables using RegEx
Parse HTML Table to list of lists
Iterate the over each list in list

Requirements

dashtable

Code

from urllib.request import urlopen
from dashtable import html2data # to convert html table to list of list
import re

url = "https://zh.wikipedia.org/wiki/%E4%B8%8A%E6%B5%B7%E8%AF%81%E5%88%B8%E4%BA%A4%E6%98%93%E6%89%80%E4%B8%8A%E5%B8%82%E5%85%AC%E5%8F%B8%E5%88%97%E8%A1%A8"

# Reading http content
data = urlopen(url).read().decode()

# now fetching all tables with the help of regex
tables = ["{}
".format(table) for table in re.findall(r"(.*?)", data, re.M|re.S|re.I)]

# parsing data
parsed_tables = [html2data(table)[0] for table in tables]  # html2data returns a tuple with 0th index as list of lists



# lets take first table ie 600000-600099
parsed = parsed_tables[0]

# column names of first table
print(parsed[0])

# rows of first table 2nd column
for index in range(1, len(parsed)):
    print(parsed[index][1])


"""
Output: All the rows of table 1, column 2 excluding the headers
"""

Extracting tables from web

Answers (1)

Related Questions