anon
anon

Reputation:

Write multiple files inside for-loop

I am trying to crawl several links, extract text found on <p> HTML tag and write output to different files. Each link should have its own output file. So far:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import csv
import pyperclip
import pprint
import requests

urls = ['https://link1',
        'https://link2']
url_list = list(urls)

#scrape elements
for url in urls:
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(response.content, "html.parser")
    page = soup.find_all('p')
    page = soup.getText()
for line in urls:
    with open('filename{}.txt'.format(line), 'w', encoding="utf8") as outfile:
        outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))

I am getting OSError: [Errno 22] Invalid argument: filenamehttps://link1

If I change my code into this

for index, line in enumerate(urls):
    with open('filename{}.txt'.format(index), 'w', encoding="utf8") as outfile:
        outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))

The script runs but I have a semantic error; both output files contain the text extracted from link2. I guess the second for-loop does this.

I've researched S/O for similar 1 answers but I can't figure it out.

Upvotes: 0

Views: 707

Answers (1)

baduker
baduker

Reputation: 20042

I'm guessing you're on some sort of *nix system as the error has to do with / interpreted a part of the path.

So, you have to do something to name your files correctly or create the path you want to save the output.

Having said that, using the URL as a file name is not a great idea, because of the above error.

You could either replace the / with, say _ or just name your files differently.

Also, this:

urls = ['https://link1',
        'https://link2']

Is already a list, so no need for this:

url_list = list(urls)

And there's no need for two for loops. You can write to a file as you scrape the URLS from the list.

Here's the working code with some dummy website:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

urls = ['https://lipsum.com/', 'https://de.lipsum.com/']

for url in urls:
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(response.content, "html.parser")
    page = soup.find("div", {"id": "Panes"}).find("p").getText()
    with open('filename_{}.txt'.format(url.replace("/", "_")), 'w', encoding="utf8") as outfile:
        outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))

You could also use your approach with enumerate():

import requests
from bs4 import BeautifulSoup

urls = ['https://lipsum.com/', 'https://de.lipsum.com/']

for index, url in enumerate(urls, start=1):
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(response.content, "html.parser")
    page = soup.find("div", {"id": "Panes"}).find("p").getText()
    with open('filename_{}.txt'.format(index), 'w', encoding="utf8") as outfile:
        outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))

Upvotes: 1

Related Questions