Reputation: 71
I tried to scrape table from this website: https://stockrow.com/VRTX/financials/income/quarterly
I am using Python Google Colab and I'd like to have the dates as columns. (e.g. 2020-06-30 etc) I used code to do something like this:
source = urllib.request.urlopen('https://stockrow.com/VRTX/financials/income/quarterly').read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find_all('table')
However, I cannot get the tables. I am a bit new to scraping so I looked at other Stackoverflow pages but couldn't solve the problem. Can you please help me? That would be much appreciated.
Upvotes: 0
Views: 423
Reputation: 195553
You can use their API to load the data:
import requests
import pandas as pd
indicators_url = 'https://stockrow.com/api/indicators.json'
data_url = 'https://stockrow.com/api/companies/VRTX/financials.json?ticker=VRTX&dimension=Q§ion=Income+Statement'
indicators = {i['id']: i for i in requests.get(indicators_url).json()}
all_data = []
for d in requests.get(data_url).json():
d['id'] = indicators[d['id']]['name']
all_data.append(d)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Prints:
id 2020-06-30 2020-03-31 2019-12-31 2019-09-30 2019-06-30 ... 2011-12-31 2011-09-30 2011-06-30 2011-03-31 2010-12-31 2010-09-30
0 Consolidated Net Income/Loss 837270000.0 602753000.0 583234100.0 57518000.0 267427000.0 ... 188141000.0 228452000.0 -199318000.0 -176096000.0 -180392000.0 -208957000.0
1 EPS (Basic, from Continuous Ops) 3.2248 2.3199 2.2654 0.2239 1.044 ... 0.9374 1.109 -0.9751 -0.8703 -0.8966 -1.0402
2 Net Profit Margin 0.5492 0.3978 0.4127 0.0606 0.2841 ... 0.2816 0.3354 -1.5213 -2.3906 -2.7531 -8.7816
3 Gross Profit 1339965000.0 1352610000.0 1228253000.0 817914000.0 805553000.0 ... 533213000.0 620794000.0 105118000.0 70996000.0 62475000.0 20567000.0
4 Income Tax Provision -12500000.0 54781000.0 93716000.0 13148000.0 59711000.0 ... 22660000.0 -27842000.0 24448000.0 0.0 NaN 0.0
5 Operating Income 718033000.0 720224100.0 551464400.0 99333000.0 269960000.0 ... 223901900.0 215707000.0 -165890000.0 -159899000.0 -166634000.0 -199588000.0
6 EBIT 718033000.0 720224100.0 551464700.0 99333000.0 269960000.0 ... 223901900.0 215707000.0 -165890000.0 -159899000.0 -166634000.0 -199588000.0
7 EPS (Diluted, from Cont. Ops) 3.1787 2.2874 2.2319 0.2208 1.0293 ... 1.0011 1.0415 -0.9751 -0.8703 -0.8966 -1.0402
8 EBITDA 744730000.0 747045000.0 577720400.0 125180000.0 297658000.0 ... 233625900.0 223457000.0 -157181000.0 -151041000.0 -158429000.0 -192830000.0
9 EPS (Basic, Consolidated) 3.2248 2.3199 2.2654 0.2239 1.044 ... 0.9374 1.109 -0.9751 -0.8703 -0.8966 -1.0402
10 EBT 824770000.0 657534000.0 676950000.0 70666000.0 327138000.0 ... 210801000.0 200610000.0 -174870000.0 -176096000.0 -180392000.0 -208957000.0
11 Operating Cash Flow Margin 0.6812 0.5384 0.3156 0.3525 0.4927 ... 0.8941 0.0651 -1.8894 -2.5336 -2.535 -6.8918
12 EBT margin 0.541 0.434 0.479 0.0744 0.3475 ... 0.3742 0.3043 -1.5283 -2.3906 -2.7531 -8.7816
13 EBIT Margin 0.471 0.4754 0.3902 0.1046 0.2868 ... 0.3975 0.3272 -1.4498 -2.1707 -2.5431 -8.3878
14 Income from Continuous Operations 837270000.0 602753000.0 583234000.0 57518000.0 267427000.0 ... 188141000.0 228452000.0 -199318000.0 -176096000.0 -180392000.0 -208957000.0
15 R&D Expenses 420928000.0 448528000.0 480011000.0 555948000.0 379091000.0 ... 186438000.0 189052000.0 173604000.0 158612000.0 168888000.0 170434000.0
16 Non-operating Interest Expenses 13871000.0 14136000.0 14249000.0 14548000.0 14837000.0 ... 11659000.0 7059000.0 6962000.0 12001000.0 7686000.0 3951000.0
17 EBITDA Margin 0.4885 0.4931 0.4088 0.1318 0.3162 ... 0.4147 0.339 -1.3737 -2.0505 -2.4179 -8.1038
18 Non-operating Income/Expense 106737000.0 -62690000.0 125485000.0 -28667000.0 57178000.0 ... -13101000.0 -15097000.0 -8980000.0 -16197000.0 -13758000.0 -9369000.0
19 EPS (Basic) 3.22 2.32 2.26 0.22 1.04 ... 0.76 1.06 -0.85 -0.87 -0.9 -1.04
20 Gross Margin 0.879 0.8927 0.8691 0.8611 0.8558 ... 0.9465 0.9417 0.9187 0.9638 0.9535 0.8643
21 Revenue 1524485000.0 1515107000.0 1413265000.0 949828000.0 941293000.0 ... 563340000.0 659200000.0 114424000.0 73662000.0 65524000.0 23795000.0
22 Shares (Diluted, Average) 263403000.0 263515000.0 262108000.0 260473000.0 259822000.0 ... 217602000.0 219349000.0 204413000.0 202329000.0 201355000.0 200887000.0
23 Cost of Revenue 184520000.0 162497000.0 185012000.0 131914000.0 135740000.0 ... 30127000.0 38406000.0 9306000.0 2666000.0 3049000.0 3228000.0
24 SG&A Expenses 191804000.0 182258000.0 195277000.0 159674000.0 156502000.0 ... 121881000.0 110654000.0 96663000.0 71523000.0 62478000.0 48855000.0
25 EPS (Diluted, Consolidated) 3.1787 2.2874 2.2319 0.2208 1.0293 ... 1.0011 1.0415 -0.9751 -0.8703 -0.8966 -1.0402
26 Revenue Growth 0.6196 0.765 0.6242 0.2107 0.2515 ... 7.5975 26.7033 2.6185 2.2842 0.9335 -0.0466
27 Shares (Basic, Weighted) 259637000.0 259815000.0 256728000.0 256946000.0 256154000.0 ... 204891000.0 206002000.0 204413000.0 202329000.0 200402000.0 200887000.0
28 Income after Tax 837270000.0 602753000.0 583234000.0 57518000.0 267427000.0 ... 188141000.0 228452000.0 -199318000.0 -176096000.0 -180392000.0 -208957000.0
29 EPS (Diluted) 3.18 2.29 2.23 0.22 1.03 ... 0.74 1.02 -0.85 -0.87 -0.9 -1.04
30 Net Income Common 837270000.0 602753000.0 583234100.0 57518000.0 267427000.0 ... 158629000.0 221110000.0 -174069000.0 -176096000.0 -180392000.0 -208957000.0
31 Shares (Diluted, Weighted) 263403000.0 263515000.0 260673000.0 260473000.0 259822000.0 ... 208807000.0 219349000.0 204413000.0 202329000.0 200402000.0 200887000.0
32 Non-Controlling Interest NaN NaN NaN NaN NaN ... 29512000.0 7342000.0 -25249000.0 0.0 NaN 0.0
33 Dividends (Preferred) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
34 EPS (Basic, from Discontinued Ops) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
35 EPS (Diluted, from Disc. Ops) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
36 Income from Discontinued Operations NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
[37 rows x 41 columns]
And saves data.csv
:
Or donwload their XLSX from that page:
url = 'https://stockrow.com/api/companies/VRTX/financials.xlsx?dimension=Q§ion=Income%20Statement&sort=desc'
df = pd.read_excel(url)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(df)
Upvotes: 5
Reputation: 4472
First problem is, that table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium.
Second problem is, that there is no table tag in HTML, it uses grid formatting.
Since you're using Google Colab, you'll need to install there selenium web driver (code taken from this answer):
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
After that you can load the page and parse it:
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
# load page via selenium
wd.get("https://stockrow.com/VRTX/financials/income/quarterly")
# wait 5 seconds until element with class mainGrid will be loaded
grid = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'mainGrid')))
# parse content of the grid
soup = BeautifulSoup(grid.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', {'class': 'financials-value'}):
print(tag)
Upvotes: 2