Reputation: 77
I have an idx file: https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx
I could open the idx file with following codes one year ago, but the codes don't work now. Why is that? How should I modify the code?
import requests
import urllib
from bs4 import BeautifulSoup
master_data = []
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
byte_data = requests.get(file_url).content
data_format = byte_data.decode('utf-8').split('------')
content = data_format[-1]
data_list = content.replace('\n','|').split('|')
for index, item in enumerate(data_list):
if '.txt' in item:
if data_list[index - 2] == '10-K':
entry_list = data_list[index - 4: index + 1]
entry_list[4] = "https://www.sec.gov/Archives/" + entry_list[4]
master_data.append(entry_list)
print(master_data)
Upvotes: 1
Views: 611
Reputation: 1148
If you had inspected the contents of the byte_data
variable, you would find that it does not have the actual content of the idx file. It is basically present to prevent scraping bots like yours. You can find more information in this answer: Problem HTTP error 403 in Python 3 Web Scraping
In this case, your answer would be to just use the User-Agent in the header for the request.
import requests
master_data = []
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2020/QTR4/master.20201231.idx"
byte_data = requests.get(file_url, allow_redirects=True, headers={"User-Agent": "XYZ/3.0"}).content
# Your further processing here
On a side note, your processing does not print anything as the if condition is never met for any of the lines, so do not think this solution does not work.
Upvotes: 0