Sam.H
Sam.H

Reputation: 327

BeautifulSoup4 table scraping when no unique class - Learning

I am still trying to learn scraping websites. I came across this case related to my work: website: https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files

Screening sanctioned lists. I am trying to get the 3 CSV hrefs as you see in the:

I know I can just import the csv from the link itself in python but I want to enhance my scraping knowledge. All the "tr" has the same class name "text-align-left". So I thought of looping over the main "table" element with class "ms-rteTble-1" and then narrow down my choices to those that end with CSV, but my loop doesn't seem to work. What am I doing wrong? Appreciate all the help

source_html = requests.get('https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files').text
soup = BeautifulSoup(source_html,'lxml')

for i in soup.find_all('table', attrs={'class' : 'ms-rteTable-1'}):
    print(i.a['href'])

I am getting only the first file which is .zip, how can I do it instead?

Upvotes: 1

Views: 85

Answers (2)

baduker
baduker

Reputation: 20062

I'd like to offer a different approach. Why not go for all anchors that have .csv in them?

Here's how:

import requests
from bs4 import BeautifulSoup

url = 'https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files'
soup = BeautifulSoup(requests.get(url).text, 'lxml').find_all(lambda t: t.name == "a" and ".CSV" in t.text)
print([a["href"] for a in soup])

Output:

['https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv']

As for your code, you're looping over a single item, which is your soup, that contains a table. You'd have to loop through all the rows, to get all href attributes.

To fix that, try this:

import requests
from bs4 import BeautifulSoup

url = 'https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files'
soup = BeautifulSoup(requests.get(url).text, 'html.parser').find('table', attrs={'class': 'ms-rteTable-1'})

for item in soup.find_all("th"):
    anchor = item.a
    if anchor:
        print(anchor["href"])

Output:

https://www.treasury.gov/ofac/downloads/consolidated/consall.zip
https://www.treasury.gov/ofac/downloads/sanctions/1.0/cons_advanced.xml
https://www.treasury.gov/ofac/downloads/consolidated/consolidated.xml
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.pip

Upvotes: 0

madzohan
madzohan

Reputation: 11808

Just another way, using xpath selector

import lxml.html
import requests
response = requests.get('https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files', stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)
tree.xpath("//tr[5 <= position() and position() <= 7]//a/@href")

Upvotes: 1

Related Questions