gdogg371
gdogg371

Reputation: 4152

Invalid URL when using Python Requests

I am trying to access the API returning program data at this page when you scroll down and new tiles are displayed on the screen. Looking in Chrome Tools I have found the API being called and put together the following Requests script:

import requests

session = requests.session()

url = 'https://ie.api.atom.nowtv.com/adapter-atlas/v3/query/node?slug=/entertainment/collections/all-entertainment&represent=(items[take=60](items(items[select_list=iceberg])))'

session.headers = {
'Host': 'https://www.nowtv.com',
'Connection': 'keep-alive',
'Accept': 'application/json, text/javascript, */*',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'https://www.nowtv.com',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
}

scraper = cloudscraper.create_scraper(sess=session)
r = scraper.get(url)

data = r.content
print(data)

session.close()

This is returning the following only:

b'<HTML><HEAD>\n<TITLE>Invalid URL</TITLE>\n</HEAD><BODY>\n<H1>Invalid URL</H1>\nThe requested URL "&#91;no&#32;URL&#93;", is invalid.<p>\nReference&#32;&#35;9&#46;3c0f0317&#46;1608324989&#46;5902cff\n</BODY></HTML>\n'

I assume the issue is the part at the end of the URL that is in curly brackets. I am not sure however how to handle these in a Requests call. Can anyone provide the correct syntax?

Thanks

Upvotes: 0

Views: 690

Answers (1)

alecxe
alecxe

Reputation: 474151

The issue is the Host session header value, don't set it.


That should be enough. But I've done some additional things as well:

  • add the X-* headers:

    session.headers.update(**{
        'X-SkyOTT-Proposition': 'NOWTV',
        'X-SkyOTT-Language': 'en',
        'X-SkyOTT-Platform': 'PC',
        'X-SkyOTT-Territory': 'GB',
        'X-SkyOTT-Device': 'COMPUTER'
    })
    
  • visit the main page without XHR header set and with a broader Accept header value:

    text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 
    
  • I've also used params for the GET parameters - you don't have to do it, I think. It's just cleaner:

     In [33]: url = 'https://ie.api.atom.nowtv.com/adapter-atlas/v3/query/node'
    
     In [34]: response = session.get(url, params={
                  'slug': '/entertainment/collections/all-entertainment', 
                  'represent': '(items[take=60,skip=2340](items(items[select_list=iceberg])))'
              }, headers={
                  'Accept': 'application/json, text/plain, */*', 
                  'X-Requested-With':'XMLHttpRequest'
              })
    
     In [35]: response
     Out[35]: <Response [200]>
    
     In [36]: response.text
     Out[36]: '{"links":{"self":"/adapter-atlas/v3/query/node/e5b0e516-2b84-11e9-b860-83982be1b6a6"},"id":"e5b0e516-2b84-11e9-b860-83982be1b6a6","type":"CATALOGUE/COLLECTION","segmentId":"","segmentName":"default","childTypes":{"next_items":{"nodeTypes":["ASSET/PROGRAMME","CATALOGUE/SERIES"],"count":68},"items":{"nodeTypes":["ASSET/PROGRAMME","CATALOGUE/SERIES"],"count":2376},"curation-config":{"nodeTypes":["CATALOGUE/CURATIONCONFIG"],"count":1}},"attributes":{"childNodeTyp
               ...
    

Upvotes: 1

Related Questions