Reputation: 65
I'm still quite new to python and trying to use it for webscraping.
Specifically, I want to get all the quotes on this page, that are given here: "XXX full quotes by YYY" or in the case there's only one quote: "the full quote by YYY". After getting the text on each page, I want to have them saved as a separate text file.
I've been following this tutorial, but I'm a little confused in terms of how to filter the html? Honestly, I have barely any experience with HTML so it's a little hard to navigate what it means, but I THINK, the section of interest is this:
<a href="javascript:pop('../2020/
Here's my code so far, to open the webpage.
import bs4
from urllib.request import Request,urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
#define url of interest
my_url = 'http://archive.ontheissues.org/Free_Trade.htm'
# set up known browser user agent for the request to bypass HTMLError
req=Request(my_url,headers={'User-Agent': 'Mozilla/5.0'})
#opening up connection, grabbing the page
uClient = uReq(req)
page_html = uClient.read()
uClient.close()
#html is jumbled at the moment, so call html using soup function
soup = soup(page_html, "html.parser")
Any help is very appreciated.
EDIT:
My idea is that, I'll compile the relevant URLs first and store them, then try and get bs to store the text in each URL. I have managed to isolate all the links of interest:
tags = soup.findAll("a" , href=re.compile("javascript:pop"))
print(tags)
for links in tags:
link = links.get('href')
if "java" in link:
print("http://archive.ontheissues.org" + link[18:len(link)-3])
Now, how can I extract the text from each of the individual links?
Upvotes: 1
Views: 1192
Reputation: 33384
use request
and regular expression
to search particular text and save the text value into textfile
.
import requests
from bs4 import BeautifulSoup
import re
URL = 'http://archive.ontheissues.org/Free_Trade.htm'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36' }
page=requests.get(URL, headers=headers)
soup=BeautifulSoup(page.content,'html.parser')
file1 = open("Quotefile.txt","w")
for a in soup.find_all('a',text=re.compile("the full quote by|full quotes by")):
file1.writelines(a.text.strip() +"\n")
# print(a.text.strip())
file1.close()
EDITED:
import requests
from bs4 import BeautifulSoup
import re
URL = 'http://archive.ontheissues.org/Free_Trade.htm'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36' }
page=requests.get(URL, headers=headers)
soup=BeautifulSoup(page.content,'html.parser')
file1 = open("Quotefile.txt","w")
for a in soup.find_all('a',href=re.compile("javascript:pop")):
shref=a['href'].split("'")[1]
if ('Background_Free_Trade.htm' not in shref):
link="http://archive.ontheissues.org" + shref[2:len(shref)]
print(link)
file1.writelines(a.text.strip() +"\n")
file1.close()
EDITED2
import requests
from bs4 import BeautifulSoup
import re
URL = 'http://archive.ontheissues.org/Free_Trade.htm'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36' }
page=requests.get(URL, headers=headers)
soup=BeautifulSoup(page.content,'html.parser')
file1 = open("Quotefile.txt","w")
for a in soup.find_all('a',href=re.compile("javascript:pop")):
shref=a['href'].split("'")[1]
if ('Background_Free_Trade.htm' not in shref):
link="http://archive.ontheissues.org" + shref[2:len(shref)]
print(link)
pagex=requests.get(link,headers=headers)
soup=BeautifulSoup(pagex.content,'html.parser')
print(soup.find('h1').text)
file1.writelines(soup.find('h1').text +"\n")
file1.close()
Upvotes: 1
Reputation: 463
Is it something you want
soup = soup(page_html, "html.parser")
if (__name__ == '__main__'):
for tag in soup.find_all('a'): # type: Tag
if ('href' in tag.attrs and tag.attrs.get('href').startswith("javascript:pop('../2020/")):
print(tag)
Upvotes: 0