user7606063
user7606063

Reputation:

Python scraping of dynamic content (visual different from html source code)

I'm a big fan of stackoverflow and typically find solutions to my problems through this website. However, the following problem has bothered me for so long that it forced me to create an account here and ask directly:

I'm trying to scape this link: https://permid.org/1-21475776041 What i want is the row "TRCS Asset Class" and "Currency".

For starters, I'm using this code:

from bs4 import BeautifulSoup
import urllib2

url = 'https://permid.org/1-21475776041'

req = urllib2.urlopen(url)
raw = req.read()
soup = BeautifulSoup(raw)
print soup.prettify()

The html code returned (see below) is different from what you can see in your browser upon clicking the link:

<!DOCTYPE html>
<!--[if lt IE 7]>      <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html ng-app="tmsMdaasApp" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" ng-app="tmsMdaasApp">
 <!--<![endif]-->
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="max-age=0,no-cache" http-equiv="Cache-Control"/>
  <base href="/"/>
  <title ng-bind="PageTitle">
   Thomson Reuters | PermID
  </title>
  <meta content="" name="description"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="#ff8000" name="theme-color"/>
  <!-- Place favicon.ico and apple-touch-icon.png in the root directory -->
  <link href="app/vendor.daf96efe.css" rel="stylesheet"/>
  <link href="app/app.1405210f.css" rel="stylesheet"/>
  <link href="favicon.ico" rel="icon"/>
  <!-- Typekit -->
  <script src="//use.typekit.net/gnw2rmh.js">
  </script>
  <script>
   try{Typekit.load({async:true});}catch(e){}
  </script>
  <!-- // Typekit -->
  <!-- Google Tag Manager Data Layer -->
  <!--<script>
      analyticsEvent = function() {};
      analyticsSocial = function() {};
      analyticsForm = function() {};
      dataLayer = [];
    </script>-->
  <!-- // Google Tag Manager Data Layer -->
 </head>
 <body class="theme-grey" id="top" ng-esc="">
  <!--[if lt IE 7]>
      <p class="browserupgrade">You are using an <strong>outdated</strong> browser. Please <a href="http://browsehappy.com/">upgrade your browser</a> to improve your experience.</p>
    <![endif]-->
  <!-- Add your site or application content here -->
  <navbar class="tms-navbar">
  </navbar>
  <div id="body" role="main" ui-view="">
  </div>
  <div id="footer-wrapper" ng-show="!params.elementsToHide">
   <footer id="main-footer">
   </footer>
  </div>
  <!--[if lt IE 9]>
    <script src="bower_components/es5-shim/es5-shim.js"></script>
    <script src="bower_components/json3/lib/json3.min.js"></script>
    <![endif]-->
  <script src="app/vendor.8cc12370.js">
  </script>
  <script src="app/app.6e5f6ce8.js">
  </script>
 </body>
</html>

Does anyone know what I'm missing here and how I could get it to work?

Upvotes: 2

Views: 1816

Answers (2)

user7606063
user7606063

Reputation:

Thanks, Teemu Risikko - a comment (albeit not the solution) of the website you linked got me on the right path.

In case someone else is bumping into the same problem, here is my solution: I'm getting the data via requests and not via traditional "scraping" (e.g. BeautifulSoup or lxml).

  1. Navigate to the website using Google Chrome.
  2. Right-click on the website and select "Inspect".
  3. On the top navigation bar select "Network".
  4. Limit network monitor to "XHR".
  5. One of the entries (market with an arrow) shows the link that can be used with the requests library.

import requests
url = 'https://permid.org/api/mdaas/getEntityById/21475776041'
headers = {'X-AG-Access-Token': YOUR_ACCESS_TOKEN}
r = requests.get(url, headers=headers)
r.json()

Which gets me this:

{u'Asset Class': [u'Units'],
 u'Asset Class URL': [u'https://permid.org/1-302043'],
 u'Currency': [u'CAD'],
 u'Currency URL': [u'https://permid.org/1-500140'],
 u'Exchange': [u'TOR'],
 u'IsQuoteOf.mdaas': [{u'Is Quote Of': [u'Convertible Debentures Income Units'],
   u'URL': [u'https://permid.org/1-21475768667'],
   u'quoteOfInstrument': [u'21475768667'],
   u'quoteOfInstrument URL': [u'https://permid.org/1-21475768667']}],
 u'Mic': [u'XTSE'],
 u'PERM ID': [u'21475776041'],
 u'Quote Name': [u'CONVERTIBLE DEBENTURES INCOME UNT'],
 u'Quote Type': [u'equity'],
 u'RIC': [u'OCV_u.TO'],
 u'Ticker': [u'OCV.UN'],
 u'entityType': [u'Quote']}

Upvotes: 4

B.Adler
B.Adler

Reputation: 1539

Using the default user-agent with a lot of pages will give you a different looking page because it is using an outdated user-agent. This is what your output is telling you.

Reference on Changing user-agents

Thought this may be your problem, it does not exactly answer the question about getting dynamically applied changes on a webpage. To get the dynamically changed data you need to emulate the javascript requests that the page is making on load. If you make the requests that the javascript is making you will get the data that the javascript is getting.

Upvotes: 0

Related Questions