Reputation:
I am trying to crawl several links, extract text found on <p>
HTML tag and write output to different files. Each link should have its own output file. So far:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import csv
import pyperclip
import pprint
import requests
urls = ['https://link1',
'https://link2']
url_list = list(urls)
#scrape elements
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find_all('p')
page = soup.getText()
for line in urls:
with open('filename{}.txt'.format(line), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
I am getting OSError: [Errno 22] Invalid argument: filenamehttps://link1
If I change my code into this
for index, line in enumerate(urls):
with open('filename{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
The script runs but I have a semantic error; both output files contain the text extracted from link2. I guess the second for-loop does this.
I've researched S/O for similar 1 answers but I can't figure it out.
Upvotes: 0
Views: 707
Reputation: 20042
I'm guessing you're on some sort of *nix
system as the error has to do with /
interpreted a part of the path.
So, you have to do something to name your files correctly or create the path you want to save the output.
Having said that, using the URL
as a file name is not a great idea, because of the above error.
You could either replace the /
with, say _
or just name your files differently.
Also, this:
urls = ['https://link1',
'https://link2']
Is already a list, so no need for this:
url_list = list(urls)
And there's no need for two for loops
. You can write to a file as you scrape the URLS
from the list.
Here's the working code with some dummy website:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
urls = ['https://lipsum.com/', 'https://de.lipsum.com/']
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find("div", {"id": "Panes"}).find("p").getText()
with open('filename_{}.txt'.format(url.replace("/", "_")), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
You could also use your approach with enumerate()
:
import requests
from bs4 import BeautifulSoup
urls = ['https://lipsum.com/', 'https://de.lipsum.com/']
for index, url in enumerate(urls, start=1):
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find("div", {"id": "Panes"}).find("p").getText()
with open('filename_{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
Upvotes: 1