Reputation: 178
I've been trying to download a webpage in HTML, and then go through it looking for a link. However, it does not work, since I've noticed that the link somehow changes when I download the page.
This is the webpage I want to use python to download something from: https://b-ok.asia/book/4201067/7cd79d
When I open the page source of that webpage on my browser I can easily see the bit that I need: <a class="btn btn-primary dlButton addDownloadedBook" href="/dl/4201067/b9ffc6" target="" data-book_id="4201067" rel="nofollow">
(I need that /dl/..../.... bit)
However, when I try to use this code to get it using python, it does not work:
import requests
booklink="https://b-ok.asia/book/4201067/7cd79d"
downpage=requests.get(booklink, allow_redirects=True).text
print(downpage)
z=downpage.find("/dl/")
print(downpage[z+z+18])
dllink="https://b-ok.asia"+downpage[z:z+18]
print(dllink)
Here, downpage[z:z+18], which should have been "/dl/4201067/b9ffc6", instead comes out to be "/dl/4201067/89c216". I have absolutely no idea where this new number came from. When I use this, it brings me back to the original page which had the download link.
Can anyone help me out as to how to go about doing this?
Upvotes: 0
Views: 89
Reputation: 9430
I guess you want to download the book. The website changes the URL to prevent people linking to it. Presumably by using cookies or session cookies. If you use session
from requests
it keeps you cookies from one request to the next and you can download the book. The code below saves the book to book.epub
it the directory you run the script from.
import requests
import shutil
from bs4 import BeautifulSoup
sess = requests.session()
req = sess.get('https://b-ok.asia/book/4201067/7cd79d')
soup = BeautifulSoup(req.content, 'html.parser')
link = soup.find('a', {'class': 'btn btn-primary dlButton addDownloadedBook'})['href']
with sess.get(f'https://b-ok.asia{link}', stream=True) as req2:
with open('./book.epub', 'wb') as file:
shutil.copyfileobj(req2.raw, file)
Upvotes: 1
Reputation: 9969
To simply get the ahref attribute you can use .find() to get the a tag with the class.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://b-ok.asia/book/4201067/7cd79d')
if r.status_code != 200:
print("Error fetching page")
exit()
else:
content = r.content
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)
z=soup.find('a',{'class':'btn btn-primary dlButton addDownloadedBook' })
print(z['href'])
Upvotes: 0