user308827
user308827

Reputation: 22031

Download all files with extension from a page

I am trying to download all netcdf (.nc) files here: https://www.ncei.noaa.gov/data/avhrr-land-normalized-difference-vegetation-index/access/2000/

import urllib3
from bs4 import BeautifulSoup

site = urllib3.PoolManager()
base_url = 'https://www.ncei.noaa.gov//data//avhrr-land-normalized-difference-vegetation-index//access//'
html = site.request('GET', base_url + '//' + '2000')
soup = BeautifulSoup(html.data, "lxml")
list_urls = soup.find_all('.nc')

However, list_urls is empty after running this code. How can I fix it?

Upvotes: 0

Views: 293

Answers (1)

Amjad Hussain Syed
Amjad Hussain Syed

Reputation: 1040

Here is what I did soup.find_all(text=lambda t: ".nc" in t) and working fine with a progress bar as well :)

import sys
import requests
import urllib3
import humanize
from bs4 import BeautifulSoup
site = urllib3.PoolManager()
base_url = 'https://www.ncei.noaa.gov//data//avhrr-land-normalized-difference-vegetation-index//access//'
html = site.request('GET', base_url + '//' + '2000')
soup = BeautifulSoup(html.data, "lxml")
link_urls = soup.find_all(text=lambda t: ".nc" in t)
for link in link_urls:
    download_link = "{}2000/{}".format(base_url, link)
    r = requests.get(download_link, stream=True)
    total_length = r.headers.get('content-length')
    print("\nDownloading: {}\nTotalSize: {}".format(download_link, humanize.naturalsize(total_length)))
    with open(link, "wb") as f:
        print("Downloading %s" % link)
        if total_length is None:  # no content length header
            f.write(r.content)
        else:
            dl = 0
            total_length = int(total_length)
            for data in r.iter_content(chunk_size=4096):
                dl += len(data)
                f.write(data)
                done = int(50 * dl / total_length)
                sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50 - done)))
                sys.stdout.flush()

Upvotes: 1

Related Questions