download all pdf files from website doesn't support wildcard

Question

I want to download all pdf files in the website of "https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml". I tried many thing with wget as: wget --wait 10 --random-wait --continue https://journals.ametsoc.org/downloadpdf/view/journals/mwre/131/5/1520-0493_2003_131_*.co_2.pdf but I get this message: Warning: wildcards not supported in HTTP. --2024-03-29 23:01:27-- https://journals.ametsoc.org/downloadpdf/view/journals/mwre/131/5/1520-0493_2003_131_*.co_2.pdf Resolving journals.ametsoc.org (journals.ametsoc.org)... 54.73.220.207, 52.208.161.60 Connecting to journals.ametsoc.org (journals.ametsoc.org)|54.73.220.207|:443... connected. HTTP request sent, awaiting response... 500 2024-03-29 23:01:28 ERROR 500: (no description).

Is there any way to do that using wget, python or any tool? Thank you in advance.

dnelub · Accepted Answer

As far as I see, you want to do scraping from an html page, so it won't work like file manager. You need to use either the Beautifulsoap or Lxml library from Python. The following code uses th lxml library, which should do what you want. It will save pdfs to the folder where the code is executed:

import requests
from lxml import html

headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0'
            }
url="https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml"
response=requests.get(url, headers=headers)
page = html.fromstring(response.text)
url_list = page.xpath("//h1/a[@class='c-Button--link']/@href")

for url in url_list:        
    url_half = url.replace('.xml','.pdf')
    url_base = "https://journals.ametsoc.org/downloadpdf"
    url_pdf= url_base+url_half
    filename = url_half.split('/')[-1]
    response = requests.get(url_pdf, headers=headers)
    if response.headers.get('content-type') == 'application/pdf':
        # Write the content to a PDF file
        with open(filename, 'wb') as file:
            file.write(response.content)
        print("PDF file downloaded successfully!")
    else:
        print("The response does not contain a PDF file.")

download all pdf files from website doesn't support wildcard

Answers (2)

Related Questions

download all pdf files from website doesn&#39;t support wildcard

Answers (2)

Related Questions

download all pdf files from website doesn't support wildcard