Reputation: 327
I am still trying to learn scraping websites. I came across this case related to my work: website: https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files
Screening sanctioned lists. I am trying to get the 3 CSV hrefs as you see in the:
I know I can just import the csv from the link itself in python but I want to enhance my scraping knowledge. All the "tr" has the same class name "text-align-left". So I thought of looping over the main "table" element with class "ms-rteTble-1" and then narrow down my choices to those that end with CSV, but my loop doesn't seem to work. What am I doing wrong? Appreciate all the help
source_html = requests.get('https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files').text
soup = BeautifulSoup(source_html,'lxml')
for i in soup.find_all('table', attrs={'class' : 'ms-rteTable-1'}):
print(i.a['href'])
I am getting only the first file which is .zip, how can I do it instead?
Upvotes: 1
Views: 85
Reputation: 20062
I'd like to offer a different approach. Why not go for all anchors that have .csv
in them?
Here's how:
import requests
from bs4 import BeautifulSoup
url = 'https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files'
soup = BeautifulSoup(requests.get(url).text, 'lxml').find_all(lambda t: t.name == "a" and ".CSV" in t.text)
print([a["href"] for a in soup])
Output:
['https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv', 'https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv']
As for your code, you're looping over a single item, which is your soup, that contains a table. You'd have to loop through all the rows, to get all href
attributes.
To fix that, try this:
import requests
from bs4 import BeautifulSoup
url = 'https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files'
soup = BeautifulSoup(requests.get(url).text, 'html.parser').find('table', attrs={'class': 'ms-rteTable-1'})
for item in soup.find_all("th"):
anchor = item.a
if anchor:
print(anchor["href"])
Output:
https://www.treasury.gov/ofac/downloads/consolidated/consall.zip
https://www.treasury.gov/ofac/downloads/sanctions/1.0/cons_advanced.xml
https://www.treasury.gov/ofac/downloads/consolidated/consolidated.xml
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.del
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.ff
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv
https://www.treasury.gov/ofac/downloads/consolidated/cons_prim.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_add.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_alt.pip
https://www.treasury.gov/ofac/downloads/consolidated/cons_comments.pip
Upvotes: 0
Reputation: 11808
Just another way, using xpath selector
import lxml.html
import requests
response = requests.get('https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list-data-files', stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)
tree.xpath("//tr[5 <= position() and position() <= 7]//a/@href")
Upvotes: 1