j riot
j riot

Reputation: 544

Python - BeautifulSoup Webscrape

I am trying to scrape a list of URLs off of the following website (http://thedataweb.rm.census.gov/ftp/cps_ftp.html), but I am having zero luck following the tutorials. Here is one example of the code I have tried:

from bs4 import BeautifulSoup
import urllib2

url         = "http://thedataweb.rm.census.gov/ftp/cps_ftp.html"
page        = urllib2.urlopen(url)
soup        = BeautifulSoup(page.read())
cpsLinks    = soup.findAll(text = 
              "http://thedataweb.rm.census.gov/pub/cps/basic/")

print(cpsLinks)

I am trying to extract these links:

http://thedataweb.rm.census.gov/pub/cps/basic/201501-/jan15pub.dat.gz

There are probably around 200 of those links. How can I get them?

Upvotes: 2

Views: 306

Answers (1)

alecxe
alecxe

Reputation: 473763

From what I understand, you want to extract the links that follow a specific pattern. BeautifulSoup allows you to specify a regular expression pattern as an attribute value.

Let's use the following pattern: pub/cps/basic/\d+\-/\w+\.dat\.gz$'. It would match pub/cps/basic/ followed by one or more digits (\d+), followed by a hyphen (\-), followed by a slash, one or more alphanumeric characters (\w+), followed by .dat.gz at the end of the string. Note that - and . have a special meaning in regular expressions and need to be escaped with a backslash.

The code:

import re
import urllib2

from bs4 import BeautifulSoup


url = "http://thedataweb.rm.census.gov/ftp/cps_ftp.html"
soup = BeautifulSoup(urllib2.urlopen(url))

links = soup.find_all(href=re.compile(r'pub/cps/basic/\d+\-/\w+\.dat\.gz$'))

for link in links:
    print link.text, link['href']

Prints:

13,232,040 http://thedataweb.rm.census.gov/pub/cps/basic/201501-/jan15pub.dat.gz
13,204,510 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/dec14pub.dat.gz
13,394,607 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/nov14pub.dat.gz
13,409,743 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/oct14pub.dat.gz
13,208,428 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/sep14pub.dat.gz
...
10,866,849 http://thedataweb.rm.census.gov/pub/cps/basic/199801-/jan99pub.dat.gz
3,172,305 http://thedataweb.rm.census.gov/pub/cps/basic/200701-/disability.dat.gz

Upvotes: 2

Related Questions