T.singh
T.singh

Reputation: 71

How to get specific table from HTML

We have form 10-k of several companies. We want to get Earnings tables (Item 6) from the HTML. The structure of the form changes for the companies.

For e.g

url1= 'https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm' 
url2='https://www.sec.gov/Archives/edgar/data/885639/000156459019009005/kss-10k_20190202.htm'

We need to get the table in Item 6 Consolidated Financial data.

One way we tried is based on string search for Item 6, getting all the text from Item 6 to Item 7 then get the tables as following:

doc10K = requests.get(url2)

st6 =doc10K.text.lower().find("item 6")
end6 = doc10K.text.lower().find("item 7")

# get text fro item 6 and removing currency sign
item6 = doc10K.text[st6:end6].replace('$','')


Tsoup = bs.BeautifulSoup(item6, 'lxml')

# Extract all tables from the response
html_tables =Tsoup.find_all('table')

This approach doesn't work for all the forms. E.g. With KSS, we are not able to find string 'Item6'. Ideal output will be the table given in Item 6.

Upvotes: 0

Views: 269

Answers (3)

QHarr
QHarr

Reputation: 84465

With bs4 4.7.1+ you can use :contains and :has to specify the appropriate matching patterns for the table based on the html. You can use css Or syntax so either of the two patterns shown below are matched.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

urls = ['https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm','https://www.sec.gov/Archives/edgar/data/885639/000156459019009005/kss-10k_20190202.htm']

with requests.Session() as s:
    for url in urls:
        r = s.get(url)
        soup = bs(r.content, 'lxml')
        table = pd.read_html(str(soup.select_one('table:contains("Item 6") ~ div:has(table) table, p:contains("Selected Consolidated Financial Data") ~ div:has(table) table')))[0]
        table.dropna(axis = 0, how = 'all',inplace= True)
        table.dropna(axis = 1, how = 'all',inplace= True)
        table.fillna(' ', inplace=True)
        table.rename(columns= table.iloc[0], inplace = True) #set headers same as row 1
        table.drop(table.index[0:2], inplace = True)  #lose row 1
        table.reset_index(drop=True, inplace = True) #re-index
        print(table)

Upvotes: 0

dabingsou
dabingsou

Reputation: 2469

petezurich is right, but the marker is not fully positioned.

# You can try this, too. The start parameter can be a list, just match any one of the above
doc10K = requests.get(url2)

from simplified_scrapy.simplified_doc import SimplifiedDoc 
doc = SimplifiedDoc(doc10K.text)
start = doc.html.rfind('Selected Consolidated Financial Data')
if start<0:
  start = doc.html.rfind('Selected Financial Data')

tables = doc.getElementsByTag('table',start=start,end=['Item 7','Item&#160;7'])
for table in tables:
  trs = table.trs
  for tr in trs:
    tds = tr.tds
    for td in tds:
      print(td.text)
      # print(td.unescape()) #Replace HTML entity

Upvotes: 1

petezurich
petezurich

Reputation: 10184

The string item 6 seems to contain either a space or a non breaking space.

Try this cleaned code:

import requests
from bs4 import BeautifulSoup

url1= 'https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm' 
url2='https://www.sec.gov/Archives/edgar/data/885639/000156459019009005/kss-10k_20190202.htm'

doc10K = requests.get(url2)

st6 = doc10K.text.lower().find("item 6")

# found "item 6"? if not search search with underscore
if st6 == -1:
    st6 = doc10K.text.lower().find("item_6") 

end6 = doc10K.text.lower().find("item 7")
item6 = doc10K.text[st6:end6].replace('$','')
soup = BeautifulSoup(item6, 'lxml')
html_tables = soup.find_all('table')

Upvotes: 0

Related Questions