Requests making changes while downloading the HTML

Question

I've been trying to download a webpage in HTML, and then go through it looking for a link. However, it does not work, since I've noticed that the link somehow changes when I download the page.

This is the webpage I want to use python to download something from: https://b-ok.asia/book/4201067/7cd79d

When I open the page source of that webpage on my browser I can easily see the bit that I need: (I need that /dl/..../.... bit)

However, when I try to use this code to get it using python, it does not work:

import requests
booklink="https://b-ok.asia/book/4201067/7cd79d"
downpage=requests.get(booklink, allow_redirects=True).text
print(downpage)
z=downpage.find("/dl/")
print(downpage[z+z+18])
dllink="https://b-ok.asia"+downpage[z:z+18]
print(dllink)

Here, downpage[z:z+18], which should have been "/dl/4201067/b9ffc6", instead comes out to be "/dl/4201067/89c216". I have absolutely no idea where this new number came from. When I use this, it brings me back to the original page which had the download link.

Can anyone help me out as to how to go about doing this?

Dan-Dev · Accepted Answer

I guess you want to download the book. The website changes the URL to prevent people linking to it. Presumably by using cookies or session cookies. If you use session from requests it keeps you cookies from one request to the next and you can download the book. The code below saves the book to book.epub it the directory you run the script from.

import requests
import shutil
from bs4 import BeautifulSoup

sess = requests.session()
req = sess.get('https://b-ok.asia/book/4201067/7cd79d')
soup = BeautifulSoup(req.content, 'html.parser')
link = soup.find('a', {'class': 'btn btn-primary dlButton addDownloadedBook'})['href']
with sess.get(f'https://b-ok.asia{link}', stream=True) as req2:
    with open('./book.epub', 'wb') as file:
        shutil.copyfileobj(req2.raw, file)

Requests making changes while downloading the HTML

Answers (2)

Related Questions