Omi Slash
Omi Slash

Reputation: 147

Python->BeautifulSoup->Webscraping->Drop Down Menu

So I am trying to dump all of the reports from this website:

https://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm

State: All States (Not the Reed Act Benefit or Reed Act Admin)

Report: Transaction Statement

Month: All Months

Year: All Years

Looking at the Source Code of the website, I know that the state variables:

<form action="get" name="UtfReport">
<fieldset>
 <table>
<tr>

    <td>
    <label for="states">State</label><br />
        <select name="states" id="states" size="01">
        <option value="al" selected>Alabama</option>
        <option value="b2">Alabama Reed Act Benefit</option>
        <option value="b3">Alabama Reed Act Admin</option>
        <option value="ak">Alaska</option>
        <option value="a2">Alaska Reed Act Benefit</option> 

So I know that I need to create a list of string like this

https://www.treasurydirect.gov/govt/reports/tfmp/utf/[a1]/dfiw00[116]tsar.txt

https://www.treasurydirect.gov/govt/reports/tfmp/utf/[a1]/dfiw00[216]tsar.txt

....

So here is my current approach:

import requests, bs4



for i in range(1,13):
    print('https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw0'+str(i).zfill(2),'16tsar.txt')


res = requests.get('https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw00216tsar.txt')

res.raise_for_status()
states = bs4.BeautifulSoup(res.text, 'lxml')

result1.append(res.text)

My effort in create the string of url ran into a problem as well, as this is the output from the code above (there is a space between dfiw00X and 16tsar.txt and I don't know why):

https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw001 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw002 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw003 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw004 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw005 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw006 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw007 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw008 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw009 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw010 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw011 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw012 16tsar.txt

So My Question is: There must be a better way of doing this than the way I am currently trying, so if anyone can show me how, I would really appreciate it.

Thank you for your time,

Upvotes: 2

Views: 4643

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

You will have to do a little hardcoding, the requests is put together with the code in utfnav.js, the main part we are interested in is below:

//assembles path to reports

          ReportPath = "/govt/reports/tfmp/utf/"+StateName+"/dfi";
          LinkData = (ReportPath+WeekName+MonthName+YearName+ReportName+StateName+".txt");

          return true;
         }
        }
        else
        {
//displays when dates are not valid for report type selection
         alert ("The requested report is not available at this time.");
         return false;
        }
        }


    function create(form){

    //state selection
    var index;
    index = document.UtfReport.states.selectedIndex;
    StateName = document.UtfReport.states.options[index].value;

    //report selection
    index = document.UtfReport.report.selectedIndex;
    ReportName = document.UtfReport.report.options[index].value;

    //Month selection
    index = document.UtfReport.month.selectedIndex;
    MonthName = document.UtfReport.month.options[index].value;


    //Year selection
    index = document.UtfReport.year.selectedIndex;
    YearName = document.UtfReport.year.options[index].value;



    //Week selection
    WeekName = "w0"; # this is hardcoded even in Js

So we need to recreate that logic:

import requests


# ReportPath  = .. + LinkData =  ...
temp = "https://www.treasurydirect.gov/govt/reports/tfmp/utf/{state}/dfiw0{mn:0>2}{yr}{rep_name}{state}.txt"
with requests.Session() as s:
    soup = BeautifulSoup(s.get("https://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm").content)

    # StateName = document.UtfReport.states.options[index].value;
    states = [opt["value"] for opt in soup.select("#states option") if " Reed " not in opt.text]

     #YearName = document.UtfReport.year.options[index].value;
    available_years = [opt["value"] for opt in soup.select("#year option")]
    # ReportName = document.UtfReport.report.options[index].value;
    report_name = soup.find(id="report").find("option", text="Transaction Statement")["value"]

    for state in states:
        for year in available_years:
            # could do [opt["value"] for opt in soup.select("#month option")]
            # but always 12 months in a year
            for mnth in range(1, 13):
                url = temp.format(state=state, rep_name=report_name, yr=year, mn=mnth)
                print(s.get(url).text)

If you run it you will see output like:

Final Report

                                               Transaction                               Location
    Effective Date                 Shares/Par  Description Code           Memo Number    Code      Account Number
    ---------------  ------------------------  -------------------------  -------------  --------  -------------------------
11-10 STATE DEPOSITS     
    01/04/2016                    17,000.0000  11-10 STATE DEPOSITS        3308616                 AL                       
    01/05/2016                    57,000.0000  11-10 STATE DEPOSITS        3308619                 AL                       
    01/06/2016                   118,000.0000  11-10 STATE DEPOSITS        3308638                 AL                       
    01/07/2016                   129,000.0000  11-10 STATE DEPOSITS        3308657                 AL                       
    01/08/2016                   145,000.0000  11-10 STATE DEPOSITS        3308675                 AL                       
    01/11/2016                   260,000.0000  11-10 STATE DEPOSITS        3308720                 AL                       
    01/12/2016                   566,000.0000  11-10 STATE DEPOSITS        3308743                 AL                       
    01/13/2016                   307,000.0000  11-10 STATE DEPOSITS        3308764                 AL                       
    01/14/2016                   240,000.0000  11-10 STATE DEPOSITS        3308783                 AL                       
    01/15/2016                   340,000.0000  11-10 STATE DEPOSITS        3308802                 AL                       
    01/19/2016                   345,000.0000  11-10 STATE DEPOSITS        3308832                 AL                       
    01/20/2016                   510,000.0000  11-10 STATE DEPOSITS        3308859                 AL                       
    01/21/2016                   533,000.0000  11-10 STATE DEPOSITS        3308889                 AL                       
    01/22/2016                   262,000.0000  11-10 STATE DEPOSITS        3308916                 AL                       
    01/25/2016                   377,000.0000  11-10 STATE DEPOSITS        3308942                 AL                       
    01/26/2016                   778,000.0000  11-10 STATE DEPOSITS        3308968                 AL                       
    01/27/2016                   873,000.0000  11-10 STATE DEPOSITS        3308997                 AL                       
    01/28/2016                   850,000.0000  11-10 STATE DEPOSITS        3309019                 AL                       
    01/29/2016                 1,388,000.0000  11-10 STATE DEPOSITS        3309045                 AL                       
    01/29/2016                    -6,997.0000  11-10 STATE DEPOSITS        3309069       AL        AL                       
                     ------------------------
                               8,088,003.0000

21-10 STATE UI WITHDRAWAL
    01/04/2016                  -183,550.0000  21-10 STATE UI WITHDRAWAL   3308617       AL        AL                       
    01/05/2016                -3,528,550.0000  21-10 STATE UI WITHDRAWAL   3308636       AL        AL                       
    01/06/2016                  -333,800.0000  21-10 STATE UI WITHDRAWAL   3308655       AL        AL                       
    01/07/2016                  -404,700.0000  21-10 STATE UI WITHDRAWAL   3308674       AL        AL                       
    01/08/2016                  -276,600.0000  21-10 STATE UI WITHDRAWAL   3308717       AL        AL                       
    01/11/2016                  -177,600.0000  21-10 STATE UI WITHDRAWAL   3308741       AL        AL                       
    01/12/2016                -3,207,250.0000  21-10 STATE UI WITHDRAWAL   3308760       AL        AL                       
    01/13/2016                  -288,450.0000  21-10 STATE UI WITHDRAWAL   3308781       AL        AL                       
    01/14/2016                  -192,050.0000  21-10 STATE UI WITHDRAWAL   3308800       AL        AL                       
    01/15/2016                  -184,650.0000  21-10 STATE UI WITHDRAWAL   3308825       AL        AL                       
    01/19/2016                -3,115,900.0000  21-10 STATE UI WITHDRAWAL   3308855       AL        AL                       
    01/20/2016                  -343,100.0000  21-10 STATE UI WITHDRAWAL   3308876       AL        AL                       
    01/21/2016                  -187,750.0000  21-10 STATE UI WITHDRAWAL   3308906       AL        AL                       
    01/22/2016                  -135,950.0000  21-10 STATE UI WITHDRAWAL   3308937       AL        AL                       
    01/25/2016                  -136,000.0000  21-10 STATE UI WITHDRAWAL   3308963       AL        AL                       
    01/26/2016                -3,186,100.0000  21-10 STATE UI WITHDRAWAL   3308985       AL        AL                       
    01/27/2016                  -310,500.0000  21-10 STATE UI WITHDRAWAL   3309014       AL        AL                       
    01/28/2016                  -250,500.0000  21-10 STATE UI WITHDRAWAL   3309036       AL        AL                       
    01/29/2016                  -147,300.0000  21-10 STATE UI WITHDRAWAL   3309066       AL        AL                       
                     ------------------------
                             -16,590,300.0000

34-10 BT FROM UI         
    01/22/2016                   -63,394.0000  34-10 BT FROM UI            3308938       AL        AL                       
    01/29/2016                   -19,169.0000  34-10 BT FROM UI            3309067       AL        AL                       
                     ------------------------
                                 -82,563.0000

34-60 CWC OUT            
    01/08/2016                    -2,577.9500  34-60 CWC OUT               3308718       HI        AL                       
    01/12/2016                   -29,354.7300  34-60 CWC OUT               3308761       WY        AL                       
    01/12/2016                    -4,186.2000  34-60 CWC OUT               3308762       NH        AL                       
    01/15/2016                    -7,390.5700  34-60 CWC OUT               3308826       MT        AL                       
    01/15/2016                   -34,003.1200  34-60 CWC OUT               3308827       WV        AL                       
    01/15/2016                    -2,674.2900  34-60 CWC OUT               3308828       RI        AL                       
    01/15/2016                   -12,695.3300  34-60 CWC OUT               3308829       NE        AL                       
    01/15/2016                   -30,307.5600  34-60 CWC OUT               3308830       IN        AL                       
    01/20/2016                  -115,833.7900  34-60 CWC OUT               3308879       VA        AL                       
    01/20/2016                    -6,549.9200  34-60 CWC OUT               3308880       AK        AL                       
    01/20/2016                   -10,316.4900  34-60 CWC OUT               3308881       ME        AL                       
    01/20/2016                   -89,399.3900  34-60 CWC OUT               3308882       CA        AL                       
    01/25/2016                   -10,015.5900  34-60 CWC OUT               3308966       MO        AL                       
    01/26/2016                      -117.6100  34-60 CWC OUT               3308988       VT        AL                       
    01/26/2016                   -17,058.7500  34-60 CWC OUT               3308989       NV        AL                       
    01/26/2016                   -23,359.8400  34-60 CWC OUT               3308990       UT        AL                       
    01/26/2016                   -21,240.3200  34-60 CWC OUT               3308991       OK        AL                       
    01/26/2016                  -110,025.5800  34-60 CWC OUT               3308992       OH        AL                       
    01/26/2016                   -87,745.5400  34-60 CWC OUT               3308993       MN        AL                       
    01/26/2016                    -1,747.0500  34-60 CWC OUT               3308994       DE        AL                       
    01/28/2016                  -439,500.8500  34-60 CWC OUT               3309039       TX        AL                       
    01/28/2016                   -22,375.9600  34-60 CWC OUT               3309040       NC        AL                       
    01/28/2016                   -49,726.7300  34-60 CWC OUT               3309041       MS        AL                       
    01/28/2016                   -54,329.9400  34-60 CWC OUT               3309042       MA        AL                       
    01/28/2016                  -221,805.0100  34-60 CWC OUT               3309043       GA        AL                       
                     ------------------------
                              -1,404,338.1100

Upvotes: 4

Jordan McQueen
Jordan McQueen

Reputation: 787

A good first approach to scraping: Pattern Matching Heuristic

What you want to do, at a high level, is this:

  1. Identify a pattern in the source.
  2. Represent the nature of the pattern in the code.
  3. Scrape according to that code.

I won't code the entire thing here, but outline the general approach I would take.

A. Notice that there's a pattern to the way the reports are named. If there is a pattern, then we can assume it is possible to represent it in code.

B. Of primary interest is the last part of the url, '/ar/dfiw00216tsar.txt'.

  1. /ar/ references the state
  2. dfiw appears constant, at first glance
  3. 00216 references the date
  4. tsar references the type of report
  5. .txt appears constant, at first glance

From here, we can know to build a dictionary of all the possible states, and a dictionary of all the possible report types, and iterate through all of those combinations, including date, in each iteration of the for loop, getting the url, then saving it or otherwise processing it as needed.

Upvotes: -3

Related Questions