Reputation: 1322
I need to process weather data from this website (https://www.ftp.ncep.noaa.gov/data/nccf/com/gfs/prod/gfs.20190814/06/), each file is around 300MB. Once I download the file, I only need to read in a subset of it. I think that downloading it is going to be too slow, so I was going to use BeautifulSoup to read in the data directly from the website, like this
from bs4 import BeautifulSoup
import requests
url = 'https://www.ftp.ncep.noaa.gov/data/nccf/com/gfs/prod/gfs.20190814/06/gfs.t06z.pgrb2.0p25.f000'
response = requests.get(url)
soup = BeautifulSoup(response.content, features='lxml')
And then using the pygrib
library to read in a subset of the resulting .grib (a weather data format) file.
However, this also proves to be too slow, taking approx 5 minutes for something that will need to be done 50 times a day. Is there some faster alternative I am not thinking of?
Upvotes: 1
Views: 340
Reputation: 28370
What you can do is to download the matching .idx
file which gives you the offsets & sizes within the main file. You can then identify the parts of the file that you need and use the techniques mentioned in the accepted answer to Only download a part of the document using python requests to just get those bits.
You may need to do some additional processing to be able to read it using pygrib
the simplest option may be to download the file header and the bits that you are interested in and combine them into a single file with padding where you are not interested.
BTW you don't need the Beautiful Soup processing at all! The content
section of the requests.get
response is the data that you are after.
From the comments:
For anyone who comes across this in the future, for grib files, here is a working outline of this concept that I found: gist.github.com/blaylockbk/… – P.V.
Upvotes: 1