dswebber7373
dswebber7373

Reputation: 55

bulk download using python with requests

I've been trying to download all the files on this page (https://apps.fs.usda.gov/fia/datamart/datamart_excel.html) in bulk , but am having some issues.

All the filenames are the '{state abbreviations}.xlsm', so I can download a single file using requests using code like this:

import requests
url = 'https://apps.fs.usda.gov/fia/datamart/Workbooks/WA.xlsm'
r = requests.get(url)  
with open('WA.xlsm', 'wb') as f:
    f.write(r.content)

I believe there should be a way to incorporate this into a for loop to get all of the files, but I'm at a loss. Any advice?

Thanks!

Upvotes: 0

Views: 1101

Answers (2)

Wizard.Ritvik
Wizard.Ritvik

Reputation: 11612

Just to add on to @balderman asnwer, but if you have multiple states to get, might be slightly more efficient to use a threading approach. straightforward example using concurrent.futures:

from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
from time import time

import requests

states = ['WA', 'CA', 'VA', 'NC'] # TODO add more states

out_dir = Path('temp_files')
out_dir.mkdir(exist_ok=True)


def get_content(state: str) -> bytes:
    url = f'https://apps.fs.usda.gov/fia/datamart/Workbooks/{state}.xlsm'
    r = requests.get(url)
    return r.content


start = time()

with ThreadPoolExecutor(max_workers=max(10, len(states))) as pool:
    for state, content in zip(states, pool.map(get_content, states)):
        with open(out_dir / f'{state}.xlsm', 'wb') as f:
            f.write(content)

print('Download ThreadExecutor took', time()-start)

# Compare times with below

# start = time()
# for state in states:
#     b = get_content(state)
#     with open(out_dir / f'{state}.xlsm', 'wb') as f:
#         f.write(b)
# print('Download took', time()-start)

Upvotes: 1

balderman
balderman

Reputation: 23815

Try the below

import requests

states = ['WA','CA'] # TODO add more states
for state in states:
    url = f'https://apps.fs.usda.gov/fia/datamart/Workbooks/{state}.xlsm'
    r = requests.get(url)  
    with open(f'{state}.xlsm', 'wb') as f:
        f.write(r.content)

Upvotes: 1

Related Questions