Reputation: 3
So this is the thing, I'm trying to extract some data from the SEC database using BeautifulSoup, I'm literally new at python but I been able to write the following code.
The idea is to use a list of quote symbols in a .txt and extract "the CIK" number of each company for further use.
import requests
from bs4 import BeautifulSoup
list_path = r"C:\Users\User1\Downloads\Quote list.txt"
with open(list_path, "r") as flist:
for quote in flist:
quote = quote.replace("\n", "")
url = (r"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" + quote +
r"&type=10&dateb=&owner=exclude&count=100")
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for company_info in soup.find_all("span", {"class" :"companyName"}):
cik_code = company_info.string
print(cik_code)
So far, the code above print "none" values for the string 'cik_code'. The element in html is the following:
<span class="companyName dm-selected dm-test">
AAON INC
<acronym title="Central Index Key">CIK</acronym>
#:
<a href="/cgi-bin/browse-edgar?
action=getcompany&CIK=0000824142&owner=exclude&count=100"
class="">0000824142 (see all company filings)</a>
</span>
The cik code is the last number: 0000824142, just before "(see all company filings)"
How can I set that number to the string cik_code
Upvotes: 0
Views: 965
Reputation: 1402
I think you just need to go into the <a>
tag that's inside the <span>
tag.
for company_info in soup.find_all('span', {'class': 'companyName'}):
cik_code = company_info.find_next('a').text.split(' ', maxsplit=1)[0]
print(cik_code)
Explanation:
company_info.find_next('a')
returns:<a href="/cgi-bin/browse-edgar? action=getcompany&CIK=0000824142&owner=exclude&count=100" class="">0000824142 (see all company filings)</a>
.text
returns:0000824142 (see all company filings)
.split(' ', maxsplit=1)[0]
returns:
0000824142
Upvotes: 1