Alonsormm
Alonsormm

Reputation: 33

How I get html content from redirect website with protection using BeautifulSoup?

I have problems trying to get the html content from a web page.

In this website: https://tmofans.com/library/manga/5763/nisekoi when you click on play icon for examen in "Capitulo 230.00" its open the next link: https://tmofans.com/goto/347231 redirects you to this website: https://tmofans.com/viewer/5c187dcea0240/paginated

The problem is when you open directly on this link: https://tmofans.com/goto/347231 the page gives a message of 403 Forbidden. the only way to be redirected to final page is by clicking on the play button from first page.

I want to get the final url content using only the tmofans.com/goto link

I trie to get html content using requests and BeautifulSoup

import requests
from BeautifulSoup import BeautifulSoup

response = requests.get("https://tmofans.com/goto/347231") 
page = str(BeautifulSoup(response.content))

print page

When i do this with https://tmofans.com/goto/347231 i only get the content of 403 Forbidden page.

Upvotes: 2

Views: 648

Answers (2)

I once managed to scrape some protected pages using http.client and my browser.

I first navigated to the page I needed access to, then using the browser's developper tools I copied the request headers and used them in my script. That way your script will access the ressources the same way your browser does.

Those two methods can help you, first parse the HTTP request to get the headers (request and body may also be helpful depending on your case) and use the second one to download the file.

This may need some tweaking on your side to work.

from http.client import HTTPSConnection

def parse_headers(http_post):
    """Converts a header string to a dictionnary of its attributes."""
    
    # Regex to extract headers
    req_line = re.compile(r'(?P<method>GET|POST)\s+(?P<resource>.+?)\s+(?P<version>HTTP/1.1)')
    field_line = re.compile(r'\s*(?P<key>.+\S)\s*:\s+(?P<value>.+\S)\s*')

    first_line_end = http_post.find('\n')
    headers_end = http_post.find('\n\n')
    request = req_line.match(http_post[:first_line_end]).groupdict()
    headers = dict(field_line.findall(http_post[first_line_end:headers_end]))
    body = http_post[headers_end + 2:]

    return request, headers, body


def get_file(url, domain, headers, temp_directory):
    """
    Fetches the file located at the provided URL and returns the content.
    Uses `headers` to bypass auth.
    """
    conn = HTTPSConnection(domain)
    conn.request('GET', url, headers=headers)
    response = conn.getresponse()
    content_type = response.getheader('Content-Type')
    content_disp = response.getheader('Content-Disposition')

    # Change to whatever content type you need
    if content_type != 'application/pdf':
        conn.close()
        return
    else:
        file_content = response.read()
        conn.close()
        return file_content

The headers string should look like this :

GET /fr/backend/XXXXXXXXX/845080 HTTP/1.1
Cookie: cookie_law_consented=true; landing_page=0; _ga=GA1.2.1218703015.1546948765; _gid=GA1.2.580320014.1546948765; _jt=1.735724042.1546948764; SID=5c485bfa-3f2c-425e-a2dd-32dd800e0bb3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: br, gzip, deflate
Host: XXXXX
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.2 Safari/605.1.15
Accept-Language: fr-fr
Referer: XXXXX
Connection: keep-alive

It may change depending on the website, but using these allowed me to download files behind a login.

Upvotes: 0

Bitto
Bitto

Reputation: 8225

This website checks if you have a referer from their site, it gives you a 403 response otherwise. You can easily bypass this by setting a referer.

import requests
ref='https://tmofans.com'
headers = { 'Referer': ref }
r = requests.get('https://tmofans.com/goto/347231',headers=headers)
print(r.url)
print(r.status_code)

Output

https://tmofans.com/viewer/5c187dcea0240/paginated
200

Upvotes: 2

Related Questions