Steve.Kim
Steve.Kim

Reputation: 71

Scraping table from Python Beautifulsoup

I tried to scrape table from this website: https://stockrow.com/VRTX/financials/income/quarterly

I am using Python Google Colab and I'd like to have the dates as columns. (e.g. 2020-06-30 etc) I used code to do something like this:

source = urllib.request.urlopen('https://stockrow.com/VRTX/financials/income/quarterly').read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find_all('table')

However, I cannot get the tables. I am a bit new to scraping so I looked at other Stackoverflow pages but couldn't solve the problem. Can you please help me? That would be much appreciated.

Upvotes: 0

Views: 423

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195553

You can use their API to load the data:

import requests
import pandas as pd


indicators_url = 'https://stockrow.com/api/indicators.json'
data_url = 'https://stockrow.com/api/companies/VRTX/financials.json?ticker=VRTX&dimension=Q&section=Income+Statement'

indicators = {i['id']: i for i in requests.get(indicators_url).json()}
all_data = []
for d in requests.get(data_url).json():
    d['id'] = indicators[d['id']]['name']
    all_data.append(d)

df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)

Prints:

                                     id    2020-06-30    2020-03-31    2019-12-31   2019-09-30   2019-06-30  ...   2011-12-31   2011-09-30    2011-06-30    2011-03-31    2010-12-31    2010-09-30
0          Consolidated Net Income/Loss   837270000.0   602753000.0   583234100.0   57518000.0  267427000.0  ...  188141000.0  228452000.0  -199318000.0  -176096000.0  -180392000.0  -208957000.0
1      EPS (Basic, from Continuous Ops)        3.2248        2.3199        2.2654       0.2239        1.044  ...       0.9374        1.109       -0.9751       -0.8703       -0.8966       -1.0402
2                     Net Profit Margin        0.5492        0.3978        0.4127       0.0606       0.2841  ...       0.2816       0.3354       -1.5213       -2.3906       -2.7531       -8.7816
3                          Gross Profit  1339965000.0  1352610000.0  1228253000.0  817914000.0  805553000.0  ...  533213000.0  620794000.0   105118000.0    70996000.0    62475000.0    20567000.0
4                  Income Tax Provision   -12500000.0    54781000.0    93716000.0   13148000.0   59711000.0  ...   22660000.0  -27842000.0    24448000.0           0.0           NaN           0.0
5                      Operating Income   718033000.0   720224100.0   551464400.0   99333000.0  269960000.0  ...  223901900.0  215707000.0  -165890000.0  -159899000.0  -166634000.0  -199588000.0
6                                  EBIT   718033000.0   720224100.0   551464700.0   99333000.0  269960000.0  ...  223901900.0  215707000.0  -165890000.0  -159899000.0  -166634000.0  -199588000.0
7         EPS (Diluted, from Cont. Ops)        3.1787        2.2874        2.2319       0.2208       1.0293  ...       1.0011       1.0415       -0.9751       -0.8703       -0.8966       -1.0402
8                                EBITDA   744730000.0   747045000.0   577720400.0  125180000.0  297658000.0  ...  233625900.0  223457000.0  -157181000.0  -151041000.0  -158429000.0  -192830000.0
9             EPS (Basic, Consolidated)        3.2248        2.3199        2.2654       0.2239        1.044  ...       0.9374        1.109       -0.9751       -0.8703       -0.8966       -1.0402
10                                  EBT   824770000.0   657534000.0   676950000.0   70666000.0  327138000.0  ...  210801000.0  200610000.0  -174870000.0  -176096000.0  -180392000.0  -208957000.0
11           Operating Cash Flow Margin        0.6812        0.5384        0.3156       0.3525       0.4927  ...       0.8941       0.0651       -1.8894       -2.5336        -2.535       -6.8918
12                           EBT margin         0.541         0.434         0.479       0.0744       0.3475  ...       0.3742       0.3043       -1.5283       -2.3906       -2.7531       -8.7816
13                          EBIT Margin         0.471        0.4754        0.3902       0.1046       0.2868  ...       0.3975       0.3272       -1.4498       -2.1707       -2.5431       -8.3878
14    Income from Continuous Operations   837270000.0   602753000.0   583234000.0   57518000.0  267427000.0  ...  188141000.0  228452000.0  -199318000.0  -176096000.0  -180392000.0  -208957000.0
15                         R&D Expenses   420928000.0   448528000.0   480011000.0  555948000.0  379091000.0  ...  186438000.0  189052000.0   173604000.0   158612000.0   168888000.0   170434000.0
16      Non-operating Interest Expenses    13871000.0    14136000.0    14249000.0   14548000.0   14837000.0  ...   11659000.0    7059000.0     6962000.0    12001000.0     7686000.0     3951000.0
17                        EBITDA Margin        0.4885        0.4931        0.4088       0.1318       0.3162  ...       0.4147        0.339       -1.3737       -2.0505       -2.4179       -8.1038
18         Non-operating Income/Expense   106737000.0   -62690000.0   125485000.0  -28667000.0   57178000.0  ...  -13101000.0  -15097000.0    -8980000.0   -16197000.0   -13758000.0    -9369000.0
19                          EPS (Basic)          3.22          2.32          2.26         0.22         1.04  ...         0.76         1.06         -0.85         -0.87          -0.9         -1.04
20                         Gross Margin         0.879        0.8927        0.8691       0.8611       0.8558  ...       0.9465       0.9417        0.9187        0.9638        0.9535        0.8643
21                              Revenue  1524485000.0  1515107000.0  1413265000.0  949828000.0  941293000.0  ...  563340000.0  659200000.0   114424000.0    73662000.0    65524000.0    23795000.0
22            Shares (Diluted, Average)   263403000.0   263515000.0   262108000.0  260473000.0  259822000.0  ...  217602000.0  219349000.0   204413000.0   202329000.0   201355000.0   200887000.0
23                      Cost of Revenue   184520000.0   162497000.0   185012000.0  131914000.0  135740000.0  ...   30127000.0   38406000.0     9306000.0     2666000.0     3049000.0     3228000.0
24                        SG&A Expenses   191804000.0   182258000.0   195277000.0  159674000.0  156502000.0  ...  121881000.0  110654000.0    96663000.0    71523000.0    62478000.0    48855000.0
25          EPS (Diluted, Consolidated)        3.1787        2.2874        2.2319       0.2208       1.0293  ...       1.0011       1.0415       -0.9751       -0.8703       -0.8966       -1.0402
26                       Revenue Growth        0.6196         0.765        0.6242       0.2107       0.2515  ...       7.5975      26.7033        2.6185        2.2842        0.9335       -0.0466
27             Shares (Basic, Weighted)   259637000.0   259815000.0   256728000.0  256946000.0  256154000.0  ...  204891000.0  206002000.0   204413000.0   202329000.0   200402000.0   200887000.0
28                     Income after Tax   837270000.0   602753000.0   583234000.0   57518000.0  267427000.0  ...  188141000.0  228452000.0  -199318000.0  -176096000.0  -180392000.0  -208957000.0
29                        EPS (Diluted)          3.18          2.29          2.23         0.22         1.03  ...         0.74         1.02         -0.85         -0.87          -0.9         -1.04
30                    Net Income Common   837270000.0   602753000.0   583234100.0   57518000.0  267427000.0  ...  158629000.0  221110000.0  -174069000.0  -176096000.0  -180392000.0  -208957000.0
31           Shares (Diluted, Weighted)   263403000.0   263515000.0   260673000.0  260473000.0  259822000.0  ...  208807000.0  219349000.0   204413000.0   202329000.0   200402000.0   200887000.0
32             Non-Controlling Interest           NaN           NaN           NaN          NaN          NaN  ...   29512000.0    7342000.0   -25249000.0           0.0           NaN           0.0
33                Dividends (Preferred)           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
34   EPS (Basic, from Discontinued Ops)           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
35        EPS (Diluted, from Disc. Ops)           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
36  Income from Discontinued Operations           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN

[37 rows x 41 columns]

And saves data.csv:

enter image description here


Or donwload their XLSX from that page:

url = 'https://stockrow.com/api/companies/VRTX/financials.xlsx?dimension=Q&section=Income%20Statement&sort=desc'

df = pd.read_excel(url)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(df)

Upvotes: 5

Alexandra Dudkina
Alexandra Dudkina

Reputation: 4472

First problem is, that table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium.

Second problem is, that there is no table tag in HTML, it uses grid formatting.

Since you're using Google Colab, you'll need to install there selenium web driver (code taken from this answer):

!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

After that you can load the page and parse it:

from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

# load page via selenium
wd.get("https://stockrow.com/VRTX/financials/income/quarterly")

# wait 5 seconds until element with class mainGrid will be loaded
grid = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'mainGrid')))

# parse content of the grid
soup = BeautifulSoup(grid.get_attribute('innerHTML'), 'lxml')

# access grid cells, your logic should be here
for tag in soup.find_all('div', {'class': 'financials-value'}):
  print(tag)

Upvotes: 2

Related Questions