Speed up BeautifulSoup parsing?

Question

I need to process weather data from this website (https://www.ftp.ncep.noaa.gov/data/nccf/com/gfs/prod/gfs.20190814/06/), each file is around 300MB. Once I download the file, I only need to read in a subset of it. I think that downloading it is going to be too slow, so I was going to use BeautifulSoup to read in the data directly from the website, like this

from bs4 import BeautifulSoup
import requests

url = 'https://www.ftp.ncep.noaa.gov/data/nccf/com/gfs/prod/gfs.20190814/06/gfs.t06z.pgrb2.0p25.f000'
response = requests.get(url)
soup = BeautifulSoup(response.content, features='lxml')

And then using the pygrib library to read in a subset of the resulting .grib (a weather data format) file. However, this also proves to be too slow, taking approx 5 minutes for something that will need to be done 50 times a day. Is there some faster alternative I am not thinking of?

Steve Barnes · Accepted Answer

What you can do is to download the matching .idx file which gives you the offsets & sizes within the main file. You can then identify the parts of the file that you need and use the techniques mentioned in the accepted answer to Only download a part of the document using python requests to just get those bits.

You may need to do some additional processing to be able to read it using pygrib the simplest option may be to download the file header and the bits that you are interested in and combine them into a single file with padding where you are not interested.

BTW you don't need the Beautiful Soup processing at all! The content section of the requests.get response is the data that you are after.

Additional Information:

From the comments:

For anyone who comes across this in the future, for grib files, here is a working outline of this concept that I found: gist.github.com/blaylockbk/… – P.V.

Speed up BeautifulSoup parsing?

Answers (1)

Additional Information:

Related Questions