Reputation: 147
So I am trying to dump all of the reports from this website:
https://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm
State: All States (Not the Reed Act Benefit or Reed Act Admin)
Report: Transaction Statement
Month: All Months
Year: All Years
Looking at the Source Code of the website, I know that the state variables:
<form action="get" name="UtfReport">
<fieldset>
<table>
<tr>
<td>
<label for="states">State</label><br />
<select name="states" id="states" size="01">
<option value="al" selected>Alabama</option>
<option value="b2">Alabama Reed Act Benefit</option>
<option value="b3">Alabama Reed Act Admin</option>
<option value="ak">Alaska</option>
<option value="a2">Alaska Reed Act Benefit</option>
So I know that I need to create a list of string like this
https://www.treasurydirect.gov/govt/reports/tfmp/utf/[a1]/dfiw00[116]tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/[a1]/dfiw00[216]tsar.txt
....
So here is my current approach:
import requests, bs4
for i in range(1,13):
print('https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw0'+str(i).zfill(2),'16tsar.txt')
res = requests.get('https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw00216tsar.txt')
res.raise_for_status()
states = bs4.BeautifulSoup(res.text, 'lxml')
result1.append(res.text)
My effort in create the string of url ran into a problem as well, as this is the output from the code above (there is a space between dfiw00X and 16tsar.txt and I don't know why):
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw001 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw002 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw003 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw004 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw005 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw006 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw007 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw008 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw009 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw010 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw011 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw012 16tsar.txt
So My Question is: There must be a better way of doing this than the way I am currently trying, so if anyone can show me how, I would really appreciate it.
Thank you for your time,
Upvotes: 2
Views: 4643
Reputation: 180391
You will have to do a little hardcoding, the requests is put together with the code in utfnav.js, the main part we are interested in is below:
//assembles path to reports
ReportPath = "/govt/reports/tfmp/utf/"+StateName+"/dfi";
LinkData = (ReportPath+WeekName+MonthName+YearName+ReportName+StateName+".txt");
return true;
}
}
else
{
//displays when dates are not valid for report type selection
alert ("The requested report is not available at this time.");
return false;
}
}
function create(form){
//state selection
var index;
index = document.UtfReport.states.selectedIndex;
StateName = document.UtfReport.states.options[index].value;
//report selection
index = document.UtfReport.report.selectedIndex;
ReportName = document.UtfReport.report.options[index].value;
//Month selection
index = document.UtfReport.month.selectedIndex;
MonthName = document.UtfReport.month.options[index].value;
//Year selection
index = document.UtfReport.year.selectedIndex;
YearName = document.UtfReport.year.options[index].value;
//Week selection
WeekName = "w0"; # this is hardcoded even in Js
So we need to recreate that logic:
import requests
# ReportPath = .. + LinkData = ...
temp = "https://www.treasurydirect.gov/govt/reports/tfmp/utf/{state}/dfiw0{mn:0>2}{yr}{rep_name}{state}.txt"
with requests.Session() as s:
soup = BeautifulSoup(s.get("https://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm").content)
# StateName = document.UtfReport.states.options[index].value;
states = [opt["value"] for opt in soup.select("#states option") if " Reed " not in opt.text]
#YearName = document.UtfReport.year.options[index].value;
available_years = [opt["value"] for opt in soup.select("#year option")]
# ReportName = document.UtfReport.report.options[index].value;
report_name = soup.find(id="report").find("option", text="Transaction Statement")["value"]
for state in states:
for year in available_years:
# could do [opt["value"] for opt in soup.select("#month option")]
# but always 12 months in a year
for mnth in range(1, 13):
url = temp.format(state=state, rep_name=report_name, yr=year, mn=mnth)
print(s.get(url).text)
If you run it you will see output like:
Final Report
Transaction Location
Effective Date Shares/Par Description Code Memo Number Code Account Number
--------------- ------------------------ ------------------------- ------------- -------- -------------------------
11-10 STATE DEPOSITS
01/04/2016 17,000.0000 11-10 STATE DEPOSITS 3308616 AL
01/05/2016 57,000.0000 11-10 STATE DEPOSITS 3308619 AL
01/06/2016 118,000.0000 11-10 STATE DEPOSITS 3308638 AL
01/07/2016 129,000.0000 11-10 STATE DEPOSITS 3308657 AL
01/08/2016 145,000.0000 11-10 STATE DEPOSITS 3308675 AL
01/11/2016 260,000.0000 11-10 STATE DEPOSITS 3308720 AL
01/12/2016 566,000.0000 11-10 STATE DEPOSITS 3308743 AL
01/13/2016 307,000.0000 11-10 STATE DEPOSITS 3308764 AL
01/14/2016 240,000.0000 11-10 STATE DEPOSITS 3308783 AL
01/15/2016 340,000.0000 11-10 STATE DEPOSITS 3308802 AL
01/19/2016 345,000.0000 11-10 STATE DEPOSITS 3308832 AL
01/20/2016 510,000.0000 11-10 STATE DEPOSITS 3308859 AL
01/21/2016 533,000.0000 11-10 STATE DEPOSITS 3308889 AL
01/22/2016 262,000.0000 11-10 STATE DEPOSITS 3308916 AL
01/25/2016 377,000.0000 11-10 STATE DEPOSITS 3308942 AL
01/26/2016 778,000.0000 11-10 STATE DEPOSITS 3308968 AL
01/27/2016 873,000.0000 11-10 STATE DEPOSITS 3308997 AL
01/28/2016 850,000.0000 11-10 STATE DEPOSITS 3309019 AL
01/29/2016 1,388,000.0000 11-10 STATE DEPOSITS 3309045 AL
01/29/2016 -6,997.0000 11-10 STATE DEPOSITS 3309069 AL AL
------------------------
8,088,003.0000
21-10 STATE UI WITHDRAWAL
01/04/2016 -183,550.0000 21-10 STATE UI WITHDRAWAL 3308617 AL AL
01/05/2016 -3,528,550.0000 21-10 STATE UI WITHDRAWAL 3308636 AL AL
01/06/2016 -333,800.0000 21-10 STATE UI WITHDRAWAL 3308655 AL AL
01/07/2016 -404,700.0000 21-10 STATE UI WITHDRAWAL 3308674 AL AL
01/08/2016 -276,600.0000 21-10 STATE UI WITHDRAWAL 3308717 AL AL
01/11/2016 -177,600.0000 21-10 STATE UI WITHDRAWAL 3308741 AL AL
01/12/2016 -3,207,250.0000 21-10 STATE UI WITHDRAWAL 3308760 AL AL
01/13/2016 -288,450.0000 21-10 STATE UI WITHDRAWAL 3308781 AL AL
01/14/2016 -192,050.0000 21-10 STATE UI WITHDRAWAL 3308800 AL AL
01/15/2016 -184,650.0000 21-10 STATE UI WITHDRAWAL 3308825 AL AL
01/19/2016 -3,115,900.0000 21-10 STATE UI WITHDRAWAL 3308855 AL AL
01/20/2016 -343,100.0000 21-10 STATE UI WITHDRAWAL 3308876 AL AL
01/21/2016 -187,750.0000 21-10 STATE UI WITHDRAWAL 3308906 AL AL
01/22/2016 -135,950.0000 21-10 STATE UI WITHDRAWAL 3308937 AL AL
01/25/2016 -136,000.0000 21-10 STATE UI WITHDRAWAL 3308963 AL AL
01/26/2016 -3,186,100.0000 21-10 STATE UI WITHDRAWAL 3308985 AL AL
01/27/2016 -310,500.0000 21-10 STATE UI WITHDRAWAL 3309014 AL AL
01/28/2016 -250,500.0000 21-10 STATE UI WITHDRAWAL 3309036 AL AL
01/29/2016 -147,300.0000 21-10 STATE UI WITHDRAWAL 3309066 AL AL
------------------------
-16,590,300.0000
34-10 BT FROM UI
01/22/2016 -63,394.0000 34-10 BT FROM UI 3308938 AL AL
01/29/2016 -19,169.0000 34-10 BT FROM UI 3309067 AL AL
------------------------
-82,563.0000
34-60 CWC OUT
01/08/2016 -2,577.9500 34-60 CWC OUT 3308718 HI AL
01/12/2016 -29,354.7300 34-60 CWC OUT 3308761 WY AL
01/12/2016 -4,186.2000 34-60 CWC OUT 3308762 NH AL
01/15/2016 -7,390.5700 34-60 CWC OUT 3308826 MT AL
01/15/2016 -34,003.1200 34-60 CWC OUT 3308827 WV AL
01/15/2016 -2,674.2900 34-60 CWC OUT 3308828 RI AL
01/15/2016 -12,695.3300 34-60 CWC OUT 3308829 NE AL
01/15/2016 -30,307.5600 34-60 CWC OUT 3308830 IN AL
01/20/2016 -115,833.7900 34-60 CWC OUT 3308879 VA AL
01/20/2016 -6,549.9200 34-60 CWC OUT 3308880 AK AL
01/20/2016 -10,316.4900 34-60 CWC OUT 3308881 ME AL
01/20/2016 -89,399.3900 34-60 CWC OUT 3308882 CA AL
01/25/2016 -10,015.5900 34-60 CWC OUT 3308966 MO AL
01/26/2016 -117.6100 34-60 CWC OUT 3308988 VT AL
01/26/2016 -17,058.7500 34-60 CWC OUT 3308989 NV AL
01/26/2016 -23,359.8400 34-60 CWC OUT 3308990 UT AL
01/26/2016 -21,240.3200 34-60 CWC OUT 3308991 OK AL
01/26/2016 -110,025.5800 34-60 CWC OUT 3308992 OH AL
01/26/2016 -87,745.5400 34-60 CWC OUT 3308993 MN AL
01/26/2016 -1,747.0500 34-60 CWC OUT 3308994 DE AL
01/28/2016 -439,500.8500 34-60 CWC OUT 3309039 TX AL
01/28/2016 -22,375.9600 34-60 CWC OUT 3309040 NC AL
01/28/2016 -49,726.7300 34-60 CWC OUT 3309041 MS AL
01/28/2016 -54,329.9400 34-60 CWC OUT 3309042 MA AL
01/28/2016 -221,805.0100 34-60 CWC OUT 3309043 GA AL
------------------------
-1,404,338.1100
Upvotes: 4
Reputation: 787
What you want to do, at a high level, is this:
A. Notice that there's a pattern to the way the reports are named. If there is a pattern, then we can assume it is possible to represent it in code.
B. Of primary interest is the last part of the url, '/ar/dfiw00216tsar.txt'.
From here, we can know to build a dictionary of all the possible states, and a dictionary of all the possible report types, and iterate through all of those combinations, including date, in each iteration of the for loop, getting the url, then saving it or otherwise processing it as needed.
Upvotes: -3