Rsync
Rsync

Reputation: 123

How to Use Python to Iterate Through A Basic Website To Create List of URLs and then Print The Text of Each

I would like to use Python to scrape all links on the Civil Procedure URL of the Montana Code Annotated, as well as all pages linked on that page, and eventually capture the substantive text at the last link. The problem is that the base URL links to Chapters that also have URLs to Parts. And the Parts URLs have links to the text I want. So this is a "three deep" URL structure with a URL naming convention that does not use a sequential ending, like 1,2,3,4,etc.

I am new to Python, so I broke this down into steps.

FIRST, I used this to extract the text from a single URL with substantive text (i.e., three levels deep):

import requests
from bs4 import BeautifulSoup
 
url = 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

href_elem = soup.find('div', class_='mca-content mca-toc')

with open("Rsync_Test.txt", "w") as f:
    print(href_elem.text,"PAGE_END", file = f)
f.close()

SECOND, I created a list of URLS and exported it to a .txt file:

import os
from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/sections_index.html")
soup = BeautifulSoup(html_page, "html.parser")
url_base="https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/"

for link in soup.findAll('a'):
    print(url_base+link.get('href')[2:])

os.chdir("/home/rsync/Downloads/")
with open("All_URLs.txt", "w") as f:
    for link in soup.findAll('a'):
        print(url_base+link.get('href')[2:], file = f)
f.close()

THIRD, I tried scrape the text from the resulting URL list:

import os
import requests
from bs4 import BeautifulSoup

url_lst = [    'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0020/0250-0190-0010-0020.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0030/0250-0190-0010-0030.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0040/0250-0190-0010-0040.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0050/0250-0190-0010-0050.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0060/0250-0190-0010-0060.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0070/0250-0190-0010-0070.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0080/0250-0190-0010-0080.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0090/0250-0190-0010-0090.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0100/0250-0190-0010-0100.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0110/0250-0190-0010-0110.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0120/0250-0190-0010-0120.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0130/0250-0190-0010-0130.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0140/0250-0190-0010-0140.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0150/0250-0190-0010-0150.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0160/0250-0190-0010-0160.html'
    ]

for link in url_lst:
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    href_elem = soup.find('div', class_='mca-content mca-toc')
    
    for link in url_lst:
        with open("Rsync_Test.txt", "w") as f:
            print(href_elem.text,"PAGE_END", file = f)
    f.close()

My plan was to put it all together into a single script (after figuring out how to extract URLs that are three levels deep from the base URL). But the third script iterates on itself without printing separate pages for each URL, resulting in just the text from the last URL.

Any tips on how to either fix the third script so it scrapes and prints the text from all 16 of the URLs from the second script would be welcome! As would ideas on how to "pull this together" into something less convoluted.

Upvotes: 2

Views: 410

Answers (1)

Joe Thor
Joe Thor

Reputation: 1260

You are iterating through url_list twice.

Assuming you want the text of each href written to a file, removing the duplicated for loop, saving the results into a list scraped data, then writing that list to a file in its own for loop works

import os
import requests
from bs4 import BeautifulSoup

url_lst = [    'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0020/0250-0190-0010-0020.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0030/0250-0190-0010-0030.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0040/0250-0190-0010-0040.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0050/0250-0190-0010-0050.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0060/0250-0190-0010-0060.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0070/0250-0190-0010-0070.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0080/0250-0190-0010-0080.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0090/0250-0190-0010-0090.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0100/0250-0190-0010-0100.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0110/0250-0190-0010-0110.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0120/0250-0190-0010-0120.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0130/0250-0190-0010-0130.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0140/0250-0190-0010-0140.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0150/0250-0190-0010-0150.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0160/0250-0190-0010-0160.html'
    ]

new_url_list = []

for link in url_lst:
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    href_elem = soup.find('div', class_='mca-content mca-toc')
    
    new_url_list.append(href_elem.text)
MyFile=open('output.txt','w', encoding='utf-8')

for link in new_url_list:
     MyFile.write(link)
MyFile.close()

This will output a file like this


Montana Code Annotated 2019



      TITLE 25. CIVIL PROCEDURE

    


      CHAPTER 19. UNIFORM DISTRICT COURT RULES

    


      Part 1. Rules

    


      Form Of Papers Presented For Filing

    




Rule 1 - Form of Papers Presented for Filing.



                (a) Papers Defined. The word "papers" as used in this Rule includes all documents and copies except exhibits and records on appeal from lower courts.

            


                (b) Pagination, Printing, Etc. All papers shall be:

            


                (1) Typewritten, printed or equivalent;

            


                (2) Clear and permanent;

            


                (3) Equally legible to printing;

            


                (4) Of type not smaller than pica;

            


                (5) Only on standard quality opaque, unglazed, recycled paper, 8 1/2" x 11" in size.

            


                (6) Printed one side only, except copies of briefs may be printed on both sides. The original brief shall be printed on one side.

            


                (7) Lines unnumbered or numbered consecutively from the top;

            


                (8) Spaced one and one-half or double;

            


                (9) Page numbered consecutively at the bottom; and

            


                (10) Bound firmly at the top. Matters such as property descriptions or direct quotes may be single spaced. Extraneous documents not in the above format and not readily conformable may be filed in their original form and length.

            


                (c) Format. The first page of all papers shall conform to the following:

And so on until rule 16 in the data.

Upvotes: 2

Related Questions