Reputation: 21
I want to download all pdf files in the website of "https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml". I tried many thing with wget as: wget --wait 10 --random-wait --continue https://journals.ametsoc.org/downloadpdf/view/journals/mwre/131/5/1520-0493_2003_131_*.co_2.pdf but I get this message: Warning: wildcards not supported in HTTP. --2024-03-29 23:01:27-- https://journals.ametsoc.org/downloadpdf/view/journals/mwre/131/5/1520-0493_2003_131_*.co_2.pdf Resolving journals.ametsoc.org (journals.ametsoc.org)... 54.73.220.207, 52.208.161.60 Connecting to journals.ametsoc.org (journals.ametsoc.org)|54.73.220.207|:443... connected. HTTP request sent, awaiting response... 500 2024-03-29 23:01:28 ERROR 500: (no description).
Is there any way to do that using wget, python or any tool? Thank you in advance.
Upvotes: 0
Views: 130
Reputation: 16
As far as I see, you want to do scraping from an html page, so it won't work like file manager. You need to use either the Beautifulsoap
or Lxml
library from Python. The following code uses th lxml
library, which should do what you want. It will save pdfs to the folder where the code is executed:
import requests
from lxml import html
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0'
}
url="https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml"
response=requests.get(url, headers=headers)
page = html.fromstring(response.text)
url_list = page.xpath("//h1/a[@class='c-Button--link']/@href")
for url in url_list:
url_half = url.replace('.xml','.pdf')
url_base = "https://journals.ametsoc.org/downloadpdf"
url_pdf= url_base+url_half
filename = url_half.split('/')[-1]
response = requests.get(url_pdf, headers=headers)
if response.headers.get('content-type') == 'application/pdf':
# Write the content to a PDF file
with open(filename, 'wb') as file:
file.write(response.content)
print("PDF file downloaded successfully!")
else:
print("The response does not contain a PDF file.")
Upvotes: 0
Reputation: 11832
You do not need python for simple cases just use the systems own tools.
The Unix Philosophy (DOS was too, but Windows CMD.exe is better). Is to write reusable blocks of commands to adapt to a specific case. You have to write any set of commands to suit your target thus only parts of the code need to be specific whilst others are generic.
Thus all we need is HTML "get and edit" which can be Write Once Re-Use Many (WORM).
Here I have paused at the first level of runget
where It offers each PDF as a link. But can be used to fetch all those files in a second phase. For example Pass1.htm allows manual download of selected files one by one. You can bypass that step by simply not include that call.
GET.CMD (to be used by any other .BAT file
@echo off
if [%2]==[] goto usage
if /i [%1]==[file$] goto getfile$
if not [%4]==[] goto editlines
:getdata
curl -o scrape.txt "%~2"
type scrape.txt |find "%~1" >listurls.htm & exit /b
:editlines
powershell -Command "(gc '%~1') -replace '%~2', '%~3' | sc '%~4'"
exit /b
:getfile$
for /F "eol=;" %%f in (%~2) do curl -O %%f
pause & exit /b
:usage
echo %~n0 string URL
echo e.g. %~n0 ".pdf" https://example.com/file.htm
pause
Phase1.bat
call get "2.xml" https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml
type listurls.htm & pause
call get listurls.htm "/abstract" "https://journals.ametsoc.org/downloadpdf/view" pass1.txt
call get pass1.txt ".xml" ".pdf" pass2.txt
call get pass2.txt ">" ">a pdf</a></br>" pass1.htm
pass1.htm & pause
notepad pass1.htm
del pass?.txt
call get file$ filelist.txt
For phase 2 we need to continue the find and replace output to convert pass1.htm
into a filelist.txt
then run do curl -O filelist.txt
So you can do that in any text parser such as notepad (call shown above) as for a single specific case it is far quicker to edit in native system than write another 6 lines of code. The advantage is you can exclude some files and adjust for any phase 1 errors.
The Windows way to download all files in a list is,
for /F "eol=;" %f in (filelist.txt) do curl -O %f
or in a batch file
for /F "eol=;" %%f in (filelist.txt) do curl -O %%f
Upvotes: 0