Reputation: 123
I would like to use Python to scrape all links on the Civil Procedure URL of the Montana Code Annotated, as well as all pages linked on that page, and eventually capture the substantive text at the last link. The problem is that the base URL links to Chapters that also have URLs to Parts. And the Parts URLs have links to the text I want. So this is a "three deep" URL structure with a URL naming convention that does not use a sequential ending, like 1,2,3,4,etc.
I am new to Python, so I broke this down into steps.
FIRST, I used this to extract the text from a single URL with substantive text (i.e., three levels deep):
import requests
from bs4 import BeautifulSoup
url = 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
href_elem = soup.find('div', class_='mca-content mca-toc')
with open("Rsync_Test.txt", "w") as f:
print(href_elem.text,"PAGE_END", file = f)
f.close()
SECOND, I created a list of URLS and exported it to a .txt file:
import os
from bs4 import BeautifulSoup
import urllib.request
html_page = urllib.request.urlopen("http://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/sections_index.html")
soup = BeautifulSoup(html_page, "html.parser")
url_base="https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/"
for link in soup.findAll('a'):
print(url_base+link.get('href')[2:])
os.chdir("/home/rsync/Downloads/")
with open("All_URLs.txt", "w") as f:
for link in soup.findAll('a'):
print(url_base+link.get('href')[2:], file = f)
f.close()
THIRD, I tried scrape the text from the resulting URL list:
import os
import requests
from bs4 import BeautifulSoup
url_lst = [ 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0020/0250-0190-0010-0020.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0030/0250-0190-0010-0030.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0040/0250-0190-0010-0040.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0050/0250-0190-0010-0050.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0060/0250-0190-0010-0060.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0070/0250-0190-0010-0070.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0080/0250-0190-0010-0080.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0090/0250-0190-0010-0090.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0100/0250-0190-0010-0100.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0110/0250-0190-0010-0110.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0120/0250-0190-0010-0120.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0130/0250-0190-0010-0130.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0140/0250-0190-0010-0140.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0150/0250-0190-0010-0150.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0160/0250-0190-0010-0160.html'
]
for link in url_lst:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
href_elem = soup.find('div', class_='mca-content mca-toc')
for link in url_lst:
with open("Rsync_Test.txt", "w") as f:
print(href_elem.text,"PAGE_END", file = f)
f.close()
My plan was to put it all together into a single script (after figuring out how to extract URLs that are three levels deep from the base URL). But the third script iterates on itself without printing separate pages for each URL, resulting in just the text from the last URL.
Any tips on how to either fix the third script so it scrapes and prints the text from all 16 of the URLs from the second script would be welcome! As would ideas on how to "pull this together" into something less convoluted.
Upvotes: 2
Views: 410
Reputation: 1260
You are iterating through url_list
twice.
Assuming you want the text of each href written to a file, removing the duplicated for loop, saving the results into a list scraped data, then writing that list to a file in its own for loop works
import os
import requests
from bs4 import BeautifulSoup
url_lst = [ 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0020/0250-0190-0010-0020.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0030/0250-0190-0010-0030.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0040/0250-0190-0010-0040.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0050/0250-0190-0010-0050.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0060/0250-0190-0010-0060.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0070/0250-0190-0010-0070.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0080/0250-0190-0010-0080.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0090/0250-0190-0010-0090.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0100/0250-0190-0010-0100.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0110/0250-0190-0010-0110.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0120/0250-0190-0010-0120.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0130/0250-0190-0010-0130.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0140/0250-0190-0010-0140.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0150/0250-0190-0010-0150.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0160/0250-0190-0010-0160.html'
]
new_url_list = []
for link in url_lst:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
href_elem = soup.find('div', class_='mca-content mca-toc')
new_url_list.append(href_elem.text)
MyFile=open('output.txt','w', encoding='utf-8')
for link in new_url_list:
MyFile.write(link)
MyFile.close()
This will output a file like this
Montana Code Annotated 2019
TITLE 25. CIVIL PROCEDURE
CHAPTER 19. UNIFORM DISTRICT COURT RULES
Part 1. Rules
Form Of Papers Presented For Filing
Rule 1 - Form of Papers Presented for Filing.
(a) Papers Defined. The word "papers" as used in this Rule includes all documents and copies except exhibits and records on appeal from lower courts.
(b) Pagination, Printing, Etc. All papers shall be:
(1) Typewritten, printed or equivalent;
(2) Clear and permanent;
(3) Equally legible to printing;
(4) Of type not smaller than pica;
(5) Only on standard quality opaque, unglazed, recycled paper, 8 1/2" x 11" in size.
(6) Printed one side only, except copies of briefs may be printed on both sides. The original brief shall be printed on one side.
(7) Lines unnumbered or numbered consecutively from the top;
(8) Spaced one and one-half or double;
(9) Page numbered consecutively at the bottom; and
(10) Bound firmly at the top. Matters such as property descriptions or direct quotes may be single spaced. Extraneous documents not in the above format and not readily conformable may be filed in their original form and length.
(c) Format. The first page of all papers shall conform to the following:
And so on until rule 16 in the data.
Upvotes: 2