joshijos
joshijos

Reputation: 19

Downloading all 10-k filings for SEC EDGAR in python

I have currently managed to scrape all filings for a specific ticker eg. 'AAPL' and every type of filing with its link is presented in a massive dictionary. I would like only those links where the 'type':'10-k' and download all the files as HTML files. Have tried to looping over dictionary and appending to list but still getting all the types.

from urllib.request import urlopen
import certifi
import json

response = urlopen("https://financialmodelingprep.com/api/v3/sec_filings/AMZN?page=0&apikey=aa478b6f376879bc58349bd2a6f9d5eb", cafile=certifi.where())
data = response.read().decode("utf-8")
print (json.loads(data))
list = []

for p_id in data:
    if p_id['type'] == '10-K':
        list.append((p_id['finalLink']))

print(list)

#print(get_jsonparsed_data(url))

The result for this code is shown below where every type is being outputted when only 10-k is needed:

{'symbol': 'AMZN', 'fillingDate': '2014-01-31 00:00:00', 'acceptedDate': '2014-01-30 21:52:38', 'cik': '0001018724', 'type': '10-K', 'link': 'https://www.sec.gov/Archives/edgar/data/1018724/000101872414000006/0001018724-14-000006-index.htm', 'finalLink': 'https://www.sec.gov/Archives/edgar/data/1018724/000101872414000006/amzn-20131231x10k.htm'}, {'symbol': 'AMZN', 'fillingDate': '2014-01-31 00:00:00', 'acceptedDate': '2014-01-30 21:49:36', 'cik': '0001018724', 'type': 'SC 13G/A', 'link': 'https://www.sec.gov/Archives/edgar/data/1018724/000119312514029210/0001193125-14-029210-index.htm', 'finalLink': 'https://www.sec.gov/Archives/edgar/data/1018724/000119312514029210/d659830dsc13ga.htm'}, {'symbol': 'AMZN', 'fillingDate': '2014-01-30 00:00:00', 'acceptedDate': '2014-01-30 16:20:30', 'cik': '0001018724', 'type': '8-K', 'link': 

If the links get appended to the list I would ideally like to download all of them at once and save in folder. Have previously used sec_edgar_downloader package however it downloads all the 10-k files in their respective yearly folders.

Upvotes: 1

Views: 11042

Answers (2)

John F
John F

Reputation: 317

You can now use the OS datamule package to bulk download every 10-K since 2001. It takes about 2.5 minutes per year.

from datamule import Downloader
downloader = Downloader()

downloader.download_dataset("10k_2020") #{10k_year}

Bulk downloads are often not up to date. (At time of writing, last update was 9/30/24). You can download the remainder using:

downloader.download(form='10-K', date = (2024-09-29,2024-10-13))

The bulk datasets are also available on Zenodo.

Disclaimer: I am the developer of the package.

Upvotes: 0

Jay
Jay

Reputation: 2039

Instead of filtering the list of all SEC filings on the client side in your Python code, you can actually filter them directly on the server side. Considering your final objective is to download thousands of 10-K filings filed over many years, and not just Apple's 10-Ks, you're saving yourself a lot of time by filtering on the server side.

Just FYI, there are other 10-K form variants, i.e. 10-KT, 10KSB, 10KT405, 10KSB40, 10-K405. I'm not sure if you are aware of them and want to ignore them, or if you didn't know but also want to download the other variants.

Let's run through a full-fledged 10-K filing downloader implementation. Our application will be structured into two components:

  1. The first component of our application finds all URLs of 10-K filings on EDGAR filed between 2010 and 2022. You can adjust the time horizon to your needs. We also consider other 10-K variants, that is 10-KT, 10KSB, 10KT405, 10KSB40, 10-K405 and all amended/changed filings, for example 10-K/A. Once we generated a complete list of all URLs, we’re going to save the list to a file on our hard disk.
  2. The second component reads the URLs from the file, and downloads/saves all filings. We download up to 20 filings in parallel using the Render API of the SEC-API package and use Python’s multiprocessing package to speed up the download process.

1. Generate the list of 10-K URLs

The Query API is a search interface allowing us to search and find SEC filings across the entire EDGAR database by any filing meta data parameter. For example, we can find all 10-K filings filed by Apple using a ticker and form type search (formType:"10-K" AND ticker:AAPL) or build more complex search expressions using boolean and brackets operators.

The Query API returns the meta data of SEC filings matching the search query, including the URLs to the filings themselves.

The response of the Query API package represents a dictionary (short: dict) with two keys: total and filings. The value of total is a dict itself and tells us, among other things, how many filings in total match our search query. The value of filings is a list of dicts, where each dict represents all meta data of a matching filing.

The URL of a 10-K filing is the value of the linkToFilingDetails key in each filing dict, for example: https://www.sec.gov/Archives/edgar/data/1318605/000119312514069681/d668062d10k.htm

In order to for us to generate a complete list of 10-K URLs, we simply iterate over all filing dicts, read the linkToFilingDetails value and write the URL to a local file.

Be aware that it takes some time to download and save all URLs. Plan at least 30 minutes for running your application without interruption.

The URL downloader appends a new URL to the log file filing_urls.txt on each processing iteration. In case you accidentally shut down your application, you can start off from the most recently processed year without having to download already processed URLs again.

Uncomment the two lines below in your code if you want to generate all URLs at once. I deliberately uncommented them to provide a quick running example of the entire code without having to wait 30+ minutes to see results. for year in range(2022, 2009, -1): for from_batch in range(0, 9800, 200):

from sec_api import QueryApi

queryApi = QueryApi(api_key="YOUR_API_KEY")

"""
On each search request, the PLACEHOLDER in the base_query is replaced 
with our form type filter and with a date range filter.
"""
base_query = {
  "query": { 
      "query_string": { 
          "query": "PLACEHOLDER", # this will be set during runtime 
          "time_zone": "America/New_York"
      } 
  },
  "from": "0",
  "size": "200", # dont change this
  # sort returned filings by the filedAt key/value
  "sort": [{ "filedAt": { "order": "desc" } }]
}

# open the file we use to store the filing URLs
log_file = open("filing_urls.txt", "a")

# start with filings filed in 2022, then 2020, 2019, ... up to 2010
# uncomment next line to fetch all filings filed from 2022-2010
# for year in range(2022, 2009, -1):
for year in range(2022, 2020, -1):
  print("Starting download for year {year}".format(year=year))
  
  # a single search universe is represented as a month of the given year
  for month in range(1, 13, 1):
    # get 10-Q and 10-Q/A filings filed in year and month
    # resulting query example: "formType:\"10-Q\" AND filedAt:[2021-01-01 TO 2021-01-31]"
    universe_query = \
        "formType:(\"10-K\", \"10-KT\", \"10KSB\", \"10KT405\", \"10KSB40\", \"10-K405\") AND " + \
        "filedAt:[{year}-{month:02d}-01 TO {year}-{month:02d}-31]" \
        .format(year=year, month=month)
  
    # set new query universe for year-month combination
    base_query["query"]["query_string"]["query"] = universe_query;

    # paginate through results by increasing "from" parameter 
    # until we don't find any matches anymore
    # uncomment next line to fetch all 10,000 filings
    # for from_batch in range(0, 9800, 200): 
    for from_batch in range(0, 400, 200):
      # set new "from" starting position of search 
      base_query["from"] = from_batch;

      response = queryApi.get_filings(base_query)

      # no more filings in search universe
      if len(response["filings"]) == 0:
        break;

      # for each filing, only save the URL pointing to the filing itself 
      # and ignore all other data. 
      # the URL is set in the dict key "linkToFilingDetails"
      urls_list = list(map(lambda x: x["linkToFilingDetails"], response["filings"]))

      # transform list of URLs into one string by joining all list elements
      # and add a new-line character between each element.
      urls_string = "\n".join(urls_list) + "\n"
      
      log_file.write(urls_string)

    print("Filing URLs downloaded for {year}-{month:02d}".format(year=year, month=month))

log_file.close()

print("All URLs downloaded")

After running the code, you should see something like this: enter image description here

2. Download all 10-Ks from SEC EDGAR

The second component of our filing downloader loads all 10-K URLs from our log file filing_urls.txt into memory, and downloads 20 filings in parallel into the folder filings. All filings are downloaded into the same folder.

We use the Render API interface of the SEC-API Python package to download the filing by providing its URL. The Render API allows us to download up to 40 SEC filings per second in parallel. However, we don’t utilize the full bandwidth of the API because otherwise it’s very likely we end up with memory overflow exceptions (considering some filings are 400+ MB large).

The download_filing function downloads the filing from the URL, generates a file name using the last two parts of the URL and saves the downloaded file to the filings folder.

The download_all_filings is the heart and soul of our application. Here, Python's inbuilt multiprocessing.Pool method allows us to apply a function to a list of values multiple times in parallel. This way we can apply the download_filing function to values of the URLs list in parallel.

For example, setting number_of_processes to 4 results in 4 download_filing functions running in parallel where each function processes one URL. Once a download is completed, multiprocessing.Pool gets the next URL from the URLs list and calls download_filing with the new URL.

We used 40 URLs (urls = load_urls()[1:40]) to quickly test the code without having to wait hours for the download to complete. Uncomment the next line to process all URLs: urls = load_urls()

import os
import multiprocessing
from sec_api import RenderApi

renderApi = RenderApi(api_key="YOUR_API_KEY")

# download filing and save to "filings" folder
def download_filing(url):
  try:
    filing = renderApi.get_filing(url)
    # file_name example: 000156459019027952-msft-10k_20190630.htm
    file_name = url.split("/")[-2] + "-" + url.split("/")[-1] 
    download_to = "./filings/" + file_name
    with open(download_to, "w") as f:
      f.write(filing)
  except Exception as e:
    print("Problem with {url}".format(url=url))
    print(e)

# load URLs from log file
def load_urls():
  log_file = open("filing_urls.txt", "r")
  urls = log_file.read().split("\n") # convert long string of URLs into a list 
  log_file.close()
  return urls

def download_all_filings():
  print("Start downloading all filings")

  download_folder = "./filings" 
  if not os.path.isdir(download_folder):
    os.makedirs(download_folder)
    
  # uncomment next line to process all URLs
  # urls = load_urls()
  urls = load_urls()[1:40]
  print("{length} filing URLs loaded".format(length=len(urls)))

  number_of_processes = 20

  with multiprocessing.Pool(number_of_processes) as pool:
    pool.map(download_filing, urls)
  
  print("All filings downloaded")

Finally, run download_all_filings() to start downloading all 10-K filings. Your filings folder should fill up with downloaded 10-K filings and look like this:

enter image description here

Upvotes: 3

Related Questions