Reputation: 105
I am trying to write a web scraping function that does a few things:
Here is the current code:
#this is the array of URL's
urls = ['https://calevip.org/incentive-project/northern-california',
'https://www.slocleanair.org/community/grants/altfuel.php',
'https://www.mcecleanenergy.org/ev-charging/',
'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
'https://afdc.energy.gov/laws/12309',
'https://cleanvehiclerebate.org/eng/fleet',
'https://calevip.org/incentive-project/san-joaquin-valley']
import requests
from bs4 import BeautifulSoup
import sys
from websites import urls
def scrape():
for x in range (len(urls)):
f = open("test"+str(x)+".txt", 'w')
for url in urls:
page = requests.get(url)
#this line of code creates a Beautiful Soup object that takes page.content as input
soup = BeautifulSoup(page.content, "html.parser")
results = (soup.prettify().encode('cp1252', errors='ignore'))
#we need a command that enters the results into the file we just created.
f.write(str(results))
So far, I am able to get the function to perform steps 1 & 2. The problem is the text scrape from the first website are being placed into all 8 of the .text files, instead of the text scrape from the first website being placed into the first .text file, the text scrape of the second website being placed into the second file, the text scrape of the third website being placed into the third file...etc.
How do I fix this? I feel like I am close but my second FOR loop isn't written correctly.
Upvotes: 0
Views: 205
Reputation:
Try doing it this way:-
import requests
from bs4 import BeautifulSoup as BS
urls = ['https://calevip.org/incentive-project/northern-california',
'https://www.slocleanair.org/community/grants/altfuel.php',
'https://www.mcecleanenergy.org/ev-charging/',
'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
'https://afdc.energy.gov/laws/12309',
'https://cleanvehiclerebate.org/eng/fleet',
'https://calevip.org/incentive-project/san-joaquin-valley']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def scrape():
with requests.Session() as session:
i = 1
for url in urls:
try:
page = session.get(url, headers=headers)
page.raise_for_status()
with open(f'test{i}.txt', 'w') as f:
f.write(BS(page.text, 'lxml').prettify())
i += 1
except Exception as e:
print(f'Exception while processing {url} -> {e}')
if __name__ == '__main__':
scrape()
Upvotes: 2