How do I stop the 302 url redirection in a simple web crawl?

Question

I am trying to crawl a website using the Requests library in Python, and when I try:

r = requests.get('http://www.cell.com/cell-stem-cell/home', allow_redirects = False)
>>> r.status_code
302
>>> r.text
'The URL has moved here
'

and when I try:

>>> r = requests.get("https://secure.jbs.elsevierhealth.com/action/getSharedSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&code=cell-site")
>>>
>>> r.text
'








    
        Your browser doesn\'t support iFrames!
    
        Your browser doesn\'t support iFrames!
    
        Your browser doesn\'t support iFrames!
    
        Your browser doesn\'t support iFrames!
    
        Your browser doesn\'t support iFrames!
    
        Your browser doesn\'t support iFrames!
    









'

I just want to get the HTML of the original website.

Mani · Accepted Answer

You have to send User-agent along request headers to make the website to believe that the request is coming from a real web browser. So if you want the content of non-redirected url your code should be

from requests import get
content = get('http://www.cell.com/cell-stem-cell/home', headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36
 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},allow_redirects = False).content
print content

The output will be:

The URL has moved here

If you want the content of the redirected url then allows redirect, but include user-agent header. This method works for most of the websites that don't use dynamic content on their website. If you want to crawl data from a dynamic content website then you have to use web browser simulators like selinium.

How do I stop the 302 url redirection in a simple web crawl?

Answers (2)

Related Questions