Pylander
Pylander

Reputation: 1591

Python Downloading Data File from Web-Scraped URL

I'm trying to develop an automated script to download the following data file to a utility server and then ETL related processing. Looking for pythonic suggestions. Not familiar with the current best options for this type of process between urllib, urllib2, beautiful soup, requests, mechanize, selenium, etc.

The Website

"Full Replacement Monthly NPI File"

The Monthly Data File

The file name (and subsequent url) changes monthly.

Here is my current approach thus far:

from bs4 import BeautifulSoup
import urllib 
import urllib2

soup = BeautifulSoup(urllib2.urlopen('http://nppes.viva-it.com/NPI_Files.html').read())

download_links = []

for link in soup.findAll(href=True):
    urls = link.get('href', '/')
    download_links.append(urls)

target_url = download_links[2]

urllib.urlretrieve(target_url , "NPI.zip")

I am not anticipating the content on this clunky govt. site to change, so I though just selecting the 3rd element of the scraped url list would be good enough. Of course, if my entire approach is wrongheaded, I welcome correction (data analytics is the personal forte). Also, if I am using outdated libraries, unpythonic practices, or low performance options, I definitely welcome the newer and better!

Upvotes: 3

Views: 2304

Answers (1)

Roland Smith
Roland Smith

Reputation: 43495

In general requests is the easiest way to get webpages.

If the name of the data files follows the pattern NPPES_Data_Dissemination_<Month>_<year>.zip, which seems logical, you can request that directly;

import requests

url = "http://nppes.viva-it.com/NPPES_Data_Dissemination_{}_{}.zip"
r = requests.get(url.format("March", 2015))

The data is then in r.text.

If the data-file name is less certain, you can get the webpage and use a regular expression to search for links to zip files;

In [1]: import requests

In [2]: r = requests.get('http://nppes.viva-it.com/NPI_Files.html')

In [3]: import re

In [4]: re.findall('http.*NPPES.*\.zip', r.text)
Out[4]: 
['http://nppes.viva-it.com/NPPES_Data_Dissemination_March_2015.zip',
 'http://nppes.viva-it.com/NPPES_Deactivated_NPI_Report_031015.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_030915_031515_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_031615_032215_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_032315_032915_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_033015_040515_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_100614_101214_Weekly.zip']

The regular expression in In[4] basically says to find strings that start with "http", contain "NPPES" and end with ".zip". This isn't speficic enough. Let's change the regular expression as shown below;

In [5]: re.findall('http.*NPPES_Data_Dissemination.*\.zip', r.text)
Out[5]: 
['http://nppes.viva-it.com/NPPES_Data_Dissemination_March_2015.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_030915_031515_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_031615_032215_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_032315_032915_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_033015_040515_Weekly.zip',
 'http://nppes.viva-it.com/NPPES_Data_Dissemination_100614_101214_Weekly.zip']

This gives us the URLs of the file we want but also the weekly files.

In [6]: fileURLS = re.findall('http.*NPPES_Data_Dissemination.*\.zip', r.text)

Let's filter out the weekly files:

In [7]: [f for f in fileURLS if 'Weekly' not in f]
Out[7]: ['http://nppes.viva-it.com/NPPES_Data_Dissemination_March_2015.zip']

This is the URL you seek. But this whole scheme does depend on how regular the names are. You can add flags to the regular expression searches to discard the case of the letters, that would make it accept more.

Upvotes: 4

Related Questions