Reputation: 19
I have currently managed to scrape all filings for a specific ticker eg. 'AAPL' and every type of filing with its link is presented in a massive dictionary. I would like only those links where the 'type':'10-k' and download all the files as HTML files. Have tried to looping over dictionary and appending to list but still getting all the types.
from urllib.request import urlopen
import certifi
import json
response = urlopen("https://financialmodelingprep.com/api/v3/sec_filings/AMZN?page=0&apikey=aa478b6f376879bc58349bd2a6f9d5eb", cafile=certifi.where())
data = response.read().decode("utf-8")
print (json.loads(data))
list = []
for p_id in data:
if p_id['type'] == '10-K':
list.append((p_id['finalLink']))
print(list)
#print(get_jsonparsed_data(url))
The result for this code is shown below where every type is being outputted when only 10-k is needed:
{'symbol': 'AMZN', 'fillingDate': '2014-01-31 00:00:00', 'acceptedDate': '2014-01-30 21:52:38', 'cik': '0001018724', 'type': '10-K', 'link': 'https://www.sec.gov/Archives/edgar/data/1018724/000101872414000006/0001018724-14-000006-index.htm', 'finalLink': 'https://www.sec.gov/Archives/edgar/data/1018724/000101872414000006/amzn-20131231x10k.htm'}, {'symbol': 'AMZN', 'fillingDate': '2014-01-31 00:00:00', 'acceptedDate': '2014-01-30 21:49:36', 'cik': '0001018724', 'type': 'SC 13G/A', 'link': 'https://www.sec.gov/Archives/edgar/data/1018724/000119312514029210/0001193125-14-029210-index.htm', 'finalLink': 'https://www.sec.gov/Archives/edgar/data/1018724/000119312514029210/d659830dsc13ga.htm'}, {'symbol': 'AMZN', 'fillingDate': '2014-01-30 00:00:00', 'acceptedDate': '2014-01-30 16:20:30', 'cik': '0001018724', 'type': '8-K', 'link':
If the links get appended to the list I would ideally like to download all of them at once and save in folder. Have previously used sec_edgar_downloader package however it downloads all the 10-k files in their respective yearly folders.
Upvotes: 1
Views: 11042
Reputation: 317
You can now use the OS datamule package to bulk download every 10-K since 2001. It takes about 2.5 minutes per year.
from datamule import Downloader
downloader = Downloader()
downloader.download_dataset("10k_2020") #{10k_year}
Bulk downloads are often not up to date. (At time of writing, last update was 9/30/24). You can download the remainder using:
downloader.download(form='10-K', date = (2024-09-29,2024-10-13))
The bulk datasets are also available on Zenodo.
Disclaimer: I am the developer of the package.
Upvotes: 0
Reputation: 2039
Instead of filtering the list of all SEC filings on the client side in your Python code, you can actually filter them directly on the server side. Considering your final objective is to download thousands of 10-K filings filed over many years, and not just Apple's 10-Ks, you're saving yourself a lot of time by filtering on the server side.
Just FYI, there are other 10-K form variants, i.e. 10-KT, 10KSB, 10KT405, 10KSB40, 10-K405. I'm not sure if you are aware of them and want to ignore them, or if you didn't know but also want to download the other variants.
Let's run through a full-fledged 10-K filing downloader implementation. Our application will be structured into two components:
The Query API is a search interface allowing us to search and find SEC filings across the entire EDGAR database by any filing meta data parameter. For example, we can find all 10-K filings filed by Apple using a ticker and form type search (formType:"10-K" AND ticker:AAPL
) or build more complex search expressions using boolean and brackets operators.
The Query API returns the meta data of SEC filings matching the search query, including the URLs to the filings themselves.
The response of the Query API package represents a dictionary (short: dict) with two keys: total
and filings
. The value of total
is a dict itself and tells us, among other things, how many filings in total match our search query. The value of filings
is a list of dicts, where each dict represents all meta data of a matching filing.
The URL of a 10-K filing is the value of the linkToFilingDetails
key in each filing dict, for example:
https://www.sec.gov/Archives/edgar/data/1318605/000119312514069681/d668062d10k.htm
In order to for us to generate a complete list of 10-K URLs, we simply iterate over all filing dicts, read the linkToFilingDetails
value and write the URL to a local file.
Be aware that it takes some time to download and save all URLs. Plan at least 30 minutes for running your application without interruption.
The URL downloader appends a new URL to the log file filing_urls.txt
on each processing iteration. In case you accidentally shut down your application, you can start off from the most recently processed year without having to download already processed URLs again.
Uncomment the two lines below in your code if you want to generate all URLs at once. I deliberately uncommented them to provide a quick running example of the entire code without having to wait 30+ minutes to see results.
for year in range(2022, 2009, -1):
for from_batch in range(0, 9800, 200):
from sec_api import QueryApi
queryApi = QueryApi(api_key="YOUR_API_KEY")
"""
On each search request, the PLACEHOLDER in the base_query is replaced
with our form type filter and with a date range filter.
"""
base_query = {
"query": {
"query_string": {
"query": "PLACEHOLDER", # this will be set during runtime
"time_zone": "America/New_York"
}
},
"from": "0",
"size": "200", # dont change this
# sort returned filings by the filedAt key/value
"sort": [{ "filedAt": { "order": "desc" } }]
}
# open the file we use to store the filing URLs
log_file = open("filing_urls.txt", "a")
# start with filings filed in 2022, then 2020, 2019, ... up to 2010
# uncomment next line to fetch all filings filed from 2022-2010
# for year in range(2022, 2009, -1):
for year in range(2022, 2020, -1):
print("Starting download for year {year}".format(year=year))
# a single search universe is represented as a month of the given year
for month in range(1, 13, 1):
# get 10-Q and 10-Q/A filings filed in year and month
# resulting query example: "formType:\"10-Q\" AND filedAt:[2021-01-01 TO 2021-01-31]"
universe_query = \
"formType:(\"10-K\", \"10-KT\", \"10KSB\", \"10KT405\", \"10KSB40\", \"10-K405\") AND " + \
"filedAt:[{year}-{month:02d}-01 TO {year}-{month:02d}-31]" \
.format(year=year, month=month)
# set new query universe for year-month combination
base_query["query"]["query_string"]["query"] = universe_query;
# paginate through results by increasing "from" parameter
# until we don't find any matches anymore
# uncomment next line to fetch all 10,000 filings
# for from_batch in range(0, 9800, 200):
for from_batch in range(0, 400, 200):
# set new "from" starting position of search
base_query["from"] = from_batch;
response = queryApi.get_filings(base_query)
# no more filings in search universe
if len(response["filings"]) == 0:
break;
# for each filing, only save the URL pointing to the filing itself
# and ignore all other data.
# the URL is set in the dict key "linkToFilingDetails"
urls_list = list(map(lambda x: x["linkToFilingDetails"], response["filings"]))
# transform list of URLs into one string by joining all list elements
# and add a new-line character between each element.
urls_string = "\n".join(urls_list) + "\n"
log_file.write(urls_string)
print("Filing URLs downloaded for {year}-{month:02d}".format(year=year, month=month))
log_file.close()
print("All URLs downloaded")
After running the code, you should see something like this:
The second component of our filing downloader loads all 10-K URLs from our log file filing_urls.txt
into memory, and downloads 20 filings in parallel into the folder filings
. All filings are downloaded into the same folder.
We use the Render API interface of the SEC-API Python package to download the filing by providing its URL. The Render API allows us to download up to 40 SEC filings per second in parallel. However, we don’t utilize the full bandwidth of the API because otherwise it’s very likely we end up with memory overflow exceptions (considering some filings are 400+ MB large).
The download_filing
function downloads the filing from the URL, generates a file name using the last two parts of the URL and saves the downloaded file to the filings
folder.
The download_all_filings
is the heart and soul of our application. Here, Python's inbuilt multiprocessing.Pool
method allows us to apply a function to a list of values multiple times in parallel. This way we can apply the download_filing
function to values of the URLs list in parallel.
For example, setting number_of_processes
to 4 results in 4 download_filing
functions running in parallel where each function processes one URL. Once a download is completed, multiprocessing.Pool
gets the next URL from the URLs list and calls download_filing
with the new URL.
We used 40 URLs (
urls = load_urls()[1:40]
) to quickly test the code without having to wait hours for the download to complete. Uncomment the next line to process all URLs:urls = load_urls()
import os
import multiprocessing
from sec_api import RenderApi
renderApi = RenderApi(api_key="YOUR_API_KEY")
# download filing and save to "filings" folder
def download_filing(url):
try:
filing = renderApi.get_filing(url)
# file_name example: 000156459019027952-msft-10k_20190630.htm
file_name = url.split("/")[-2] + "-" + url.split("/")[-1]
download_to = "./filings/" + file_name
with open(download_to, "w") as f:
f.write(filing)
except Exception as e:
print("Problem with {url}".format(url=url))
print(e)
# load URLs from log file
def load_urls():
log_file = open("filing_urls.txt", "r")
urls = log_file.read().split("\n") # convert long string of URLs into a list
log_file.close()
return urls
def download_all_filings():
print("Start downloading all filings")
download_folder = "./filings"
if not os.path.isdir(download_folder):
os.makedirs(download_folder)
# uncomment next line to process all URLs
# urls = load_urls()
urls = load_urls()[1:40]
print("{length} filing URLs loaded".format(length=len(urls)))
number_of_processes = 20
with multiprocessing.Pool(number_of_processes) as pool:
pool.map(download_filing, urls)
print("All filings downloaded")
Finally, run download_all_filings()
to start downloading all 10-K filings. Your filings
folder should fill up with downloaded 10-K filings and look like this:
Upvotes: 3