Reputation: 3
I'm pulling together a dataset to do analysis on. The goal is to parse a table on a SEC webpage and pull out the link in a row that has the text "SC 13D" in it. This needs to be repeatable so I can automate it across a large list of links I have in a database. I know this code is not the most Pythonic, but I hacked it together to get what I need out of the table, except for the link in the table row. How can I extract the href value from the table row?
I tried doing a .findAll on 'tr' instead of 'td' in the table (Line 15) but couldn't figure out how to search on "SC 13D" and pop the element from the list of table rows if I performed the .findAll('td'). I also tried to just get the anchor tag with the link in it using the .get('a) instead of .get('href') (included in the code, line 32) but it also returns "None".
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'https://www.sec.gov/Archives/edgar/data/1050122/000101143807000336/0001011438-07-000336-index.htm'
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table',{'summary':'Document Format Files'})
rows = table.findAll("td")
i = 0
pos = 0
for row in rows:
if "SC 13D" in row:
pos = i
break
else: i = i + 1
linkpos = pos - 1
linkelement = rows[linkpos]
print(linkelement.get('a'))
print(linkelement.get('href'))
The expected results is printing out the link in linkelement. The actual result is "None".
Upvotes: 0
Views: 2978
Reputation: 2445
It is because your a
tag is inside your td
tag
You just have to do:
linkelement = rows[linkpos]
a_element = linkelement.find('a')
print(a_element.get('href'))
Upvotes: 1
Reputation: 28595
Switch your .get
to .find
You want to find the <a>
tag, and print the href
attribute
print(linkelement.find('a')['href'])
Or you need to use .get
with the tag:
print(linkelement.a.get('href'))
Upvotes: 0