Zeinab
Zeinab

Reputation: 21

download all pdf files from website doesn't support wildcard

I want to download all pdf files in the website of "https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml". I tried many thing with wget as: wget --wait 10 --random-wait --continue https://journals.ametsoc.org/downloadpdf/view/journals/mwre/131/5/1520-0493_2003_131_*.co_2.pdf but I get this message: Warning: wildcards not supported in HTTP. --2024-03-29 23:01:27-- https://journals.ametsoc.org/downloadpdf/view/journals/mwre/131/5/1520-0493_2003_131_*.co_2.pdf Resolving journals.ametsoc.org (journals.ametsoc.org)... 54.73.220.207, 52.208.161.60 Connecting to journals.ametsoc.org (journals.ametsoc.org)|54.73.220.207|:443... connected. HTTP request sent, awaiting response... 500 2024-03-29 23:01:28 ERROR 500: (no description).

Is there any way to do that using wget, python or any tool? Thank you in advance.

Upvotes: 0

Views: 130

Answers (2)

dnelub
dnelub

Reputation: 16

As far as I see, you want to do scraping from an html page, so it won't work like file manager. You need to use either the Beautifulsoap or Lxml library from Python. The following code uses th lxml library, which should do what you want. It will save pdfs to the folder where the code is executed:

import requests
from lxml import html

headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0'
            }
url="https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml"
response=requests.get(url, headers=headers)
page = html.fromstring(response.text)
url_list = page.xpath("//h1/a[@class='c-Button--link']/@href")

for url in url_list:        
    url_half = url.replace('.xml','.pdf')
    url_base = "https://journals.ametsoc.org/downloadpdf"
    url_pdf= url_base+url_half
    filename = url_half.split('/')[-1]
    response = requests.get(url_pdf, headers=headers)
    if response.headers.get('content-type') == 'application/pdf':
        # Write the content to a PDF file
        with open(filename, 'wb') as file:
            file.write(response.content)
        print("PDF file downloaded successfully!")
    else:
        print("The response does not contain a PDF file.")

Upvotes: 0

K J
K J

Reputation: 11832

You do not need python for simple cases just use the systems own tools.

The Unix Philosophy (DOS was too, but Windows CMD.exe is better). Is to write reusable blocks of commands to adapt to a specific case. You have to write any set of commands to suit your target thus only parts of the code need to be specific whilst others are generic.

Thus all we need is HTML "get and edit" which can be Write Once Re-Use Many (WORM).

Here I have paused at the first level of runget where It offers each PDF as a link. But can be used to fetch all those files in a second phase. For example Pass1.htm allows manual download of selected files one by one. You can bypass that step by simply not include that call.

enter image description here

GET.CMD (to be used by any other .BAT file

@echo off
if [%2]==[] goto usage
if /i [%1]==[file$] goto getfile$
if not [%4]==[] goto editlines
:getdata
curl -o scrape.txt "%~2"
type scrape.txt |find "%~1" >listurls.htm & exit /b
:editlines
powershell -Command "(gc '%~1') -replace '%~2', '%~3' | sc '%~4'"
exit /b
:getfile$
for /F "eol=;" %%f in (%~2) do curl -O %%f
pause & exit /b
:usage
echo %~n0 string URL
echo e.g. %~n0 ".pdf" https://example.com/file.htm
pause

Phase1.bat

call get "2.xml" https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml
type listurls.htm & pause
call get listurls.htm "/abstract" "https://journals.ametsoc.org/downloadpdf/view" pass1.txt
call get pass1.txt ".xml" ".pdf" pass2.txt
call get pass2.txt ">" ">a pdf</a></br>" pass1.htm
pass1.htm & pause
notepad pass1.htm
del pass?.txt
call get file$ filelist.txt

For phase 2 we need to continue the find and replace output to convert pass1.htm into a filelist.txt then run do curl -O filelist.txt

So you can do that in any text parser such as notepad (call shown above) as for a single specific case it is far quicker to edit in native system than write another 6 lines of code. The advantage is you can exclude some files and adjust for any phase 1 errors.

enter image description hereenter image description here enter image description here The Windows way to download all files in a list is,

for /F "eol=;" %f in (filelist.txt) do curl -O %f

or in a batch file

for /F "eol=;" %%f in (filelist.txt) do curl -O %%f

enter image description here

Upvotes: 0

Related Questions