Reputation: 33
I have problems trying to get the html content from a web page.
In this website: https://tmofans.com/library/manga/5763/nisekoi when you click on play icon for examen in "Capitulo 230.00" its open the next link: https://tmofans.com/goto/347231 redirects you to this website: https://tmofans.com/viewer/5c187dcea0240/paginated
The problem is when you open directly on this link: https://tmofans.com/goto/347231 the page gives a message of 403 Forbidden. the only way to be redirected to final page is by clicking on the play button from first page.
I want to get the final url content using only the tmofans.com/goto link
I trie to get html content using requests and BeautifulSoup
import requests
from BeautifulSoup import BeautifulSoup
response = requests.get("https://tmofans.com/goto/347231")
page = str(BeautifulSoup(response.content))
print page
When i do this with https://tmofans.com/goto/347231 i only get the content of 403 Forbidden page.
Upvotes: 2
Views: 648
Reputation: 383
I once managed to scrape some protected pages using http.client
and my browser.
I first navigated to the page I needed access to, then using the browser's developper tools I copied the request headers and used them in my script. That way your script will access the ressources the same way your browser does.
Those two methods can help you, first parse the HTTP request to get the headers (request and body may also be helpful depending on your case) and use the second one to download the file.
This may need some tweaking on your side to work.
from http.client import HTTPSConnection
def parse_headers(http_post):
"""Converts a header string to a dictionnary of its attributes."""
# Regex to extract headers
req_line = re.compile(r'(?P<method>GET|POST)\s+(?P<resource>.+?)\s+(?P<version>HTTP/1.1)')
field_line = re.compile(r'\s*(?P<key>.+\S)\s*:\s+(?P<value>.+\S)\s*')
first_line_end = http_post.find('\n')
headers_end = http_post.find('\n\n')
request = req_line.match(http_post[:first_line_end]).groupdict()
headers = dict(field_line.findall(http_post[first_line_end:headers_end]))
body = http_post[headers_end + 2:]
return request, headers, body
def get_file(url, domain, headers, temp_directory):
"""
Fetches the file located at the provided URL and returns the content.
Uses `headers` to bypass auth.
"""
conn = HTTPSConnection(domain)
conn.request('GET', url, headers=headers)
response = conn.getresponse()
content_type = response.getheader('Content-Type')
content_disp = response.getheader('Content-Disposition')
# Change to whatever content type you need
if content_type != 'application/pdf':
conn.close()
return
else:
file_content = response.read()
conn.close()
return file_content
The headers string should look like this :
GET /fr/backend/XXXXXXXXX/845080 HTTP/1.1
Cookie: cookie_law_consented=true; landing_page=0; _ga=GA1.2.1218703015.1546948765; _gid=GA1.2.580320014.1546948765; _jt=1.735724042.1546948764; SID=5c485bfa-3f2c-425e-a2dd-32dd800e0bb3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: br, gzip, deflate
Host: XXXXX
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.2 Safari/605.1.15
Accept-Language: fr-fr
Referer: XXXXX
Connection: keep-alive
It may change depending on the website, but using these allowed me to download files behind a login.
Upvotes: 0
Reputation: 8225
This website checks if you have a referer from their site, it gives you a 403 response otherwise. You can easily bypass this by setting a referer.
import requests
ref='https://tmofans.com'
headers = { 'Referer': ref }
r = requests.get('https://tmofans.com/goto/347231',headers=headers)
print(r.url)
print(r.status_code)
Output
https://tmofans.com/viewer/5c187dcea0240/paginated
200
Upvotes: 2