ItM
ItM

Reputation: 331

Getting full html back from a website request using Python

I'm trying to send an http request to a website (for ex, Digikey) and read back the full html. For example, I'm using this link: https://www.digikey.com/products/en?keywords=part_number to get a part number such as: https://www.digikey.com/products/en?keywords=511-8002-KIT. However what I get back is not the full html.

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.digikey.com/products/en?keywords=511-8002-KIT')
soup = BeautifulSoup(r.text)
print(soup.prettify())

Output:

<!DOCTYPE html>
<html>
 <head>
  <script>
   var i10cdone =(function(){ function pingBeacon(msg){ var i10cimg = document.createElement('script'); i10cimg.src='/i10c@p1/botox/file/nv-loaded.js?status='+window.encodeURIComponent(msg); i10cimg.onload = function(){ (document.head || document.documentElement).removeChild(i10cimg) }; i10cimg.onerror = function(){ (document.head || document.documentElement).removeChild(i10cimg) }; ( document.head || document.documentElement).appendChild(i10cimg) }; pingBeacon('loaded'); if(String(document.cookie).indexOf('i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo')>=0) { document.cookie = 'i10c.bdddb=;path=/';}; var error=''; function errorHandler(e) { if (e && e.error && e.error.stack ) { error=e.error.stack; } else if( e && e.message ) { error = e.message; } else { error = 'unknown';}} if(window.addEventListener) { window.addEventListener('error',errorHandler, false); } else { if ( window.attachEvent ){ window.attachEvent('onerror',errorHandler); }} return function(){ if (window.removeEventListener) {window.removeEventListener('error',errorHandler); } else { if (window.detachEvent) { window.detachEvent('onerror',errorHandler); }} if(error) { pingBeacon('error-' + String(error).substring(0,500)); document.cookie='i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo;path=/'; }}; })();
  </script>
  <script src="/i10c@p1/client/latest/auto/instart.js?i10c.nv.bucket=pci&amp;i10c.nv.host=www.digikey.com&amp;i10c.opts=botox&amp;bcb=1" type="text/javascript">
  </script>
  <script type="text/javascript">
   INSTART.Init({"apiDomain":"assets.insnw.net","correlation_id":"1553546232:4907a9bdc85fe4e8","custName":"digikey","devJsExtraFlags":"{\"disableQuerySelectorInterception\" :true,  'rumDataConfigKey':'/instartlogic/clientdatacollector/getconfig/monitorprod.json','custName':'digikey','propName':'northamerica'}","disableInjectionXhr":true,"disableInjectionXhrQueryParam":"instart_disable_injection","iframeCommunicationTimeout":3000,"nanovisorGlobalNameSpace":"I10C","partialImage":false,"propName":"northamerica","rId":"0","release":"latest","rum":false,"serveNanovisorSameDomain":true,"third_party":["IA://www.digikey.com/js/geotargeting.js"],"useIframeRpc":false,"useWrapper":false,"ver":"auto","virtualDomains":4,"virtualizeDomains":["^auth\\.digikey\\.com$","^authtest\\.digikey\\.com$","^blocked\\.digikey\\.com$","^dynatrace\\.digikey\\.com$","^search\\.digikey\\.com$","^www\\.digikey\\.ca$","^www\\.digikey\\.com$","^www\\.digikey\\.com\\.mx$"]}
);
  </script>
  <script>
   typeof i10cdone === 'function' && i10cdone();
  </script>
 </head>
 <body>
  <script>
   setTimeout(function(){document.cookie="i10c.eac23=1";window.location.reload(true);},30);
  </script>
 </body>
</html>

The reason I need the full html is to search into it for specific keywords, such as do the terms "Lead free" or "Through hole" appear in the particular part number result. I'm not only doing this for Digikey, but also other sites.

Any help would be appreciated!

Thanks!

EDIT:

Thank you all for your suggestions/answers. More info here for others who're interested in this: Web-scraping JavaScript page with Python

Upvotes: 3

Views: 1383

Answers (2)

NightShade
NightShade

Reputation: 510

The issue would be because the javascript of the page does not have time to run and therefore populate the necessary HTML elements. One solution to this would be to implement a webdriver using selenium:

from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source

Often this is a lot more inefficient since you have to fully wait for the page to load. One way to get around this would to be to look for various API's that the website provides to access the data you want directly, I would recommend doing some research into what those might be

Here are some of the potential APIs you can use to get the data directly

https://api-portal.digikey.com/product

Upvotes: 1

gtalarico
gtalarico

Reputation: 4689

Most likely the parts of the page you are looking for includes content that is generated dynamically using Javascript.

Visit view-source:https://www.digikey.com/products/en?keywords=part_number on your browser and you will see requests is fetching the full html - it's just not executing the Javascript code.

If you right-click the and click inspect (Chrome), you will see the final DOM that is created after the javascript code is executed.

To get the rendered content, you would need to use a full web-driver like Selenium that is capable of executing the Javascript to render the full page.

Here an example of how to achieve that using Selenium:

How can I parse a website using Selenium and Beautifulsoup in python?

In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
   ....:     print tag.text
   ....:     
   ....:     
Hacker News

Upvotes: 2

Related Questions