Lukemul69
Lukemul69

Reputation: 187

How to use multiple URLs from a txt file using BeautifulSoup

I am new to this and my code runs successfully but only with one URL in the .txt file, if I add more it throws an error. I have tried multiple methods I found on this site but can't seem to find one that works. If anyone can assist me that would be great.

My main objective is for it to look at the first URL, after it has completed, then start the 2nd URL and loop through them.

Here is what I have right now...

import requests
import lxml.html
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed

ua = UserAgent()
header = {'user-agent':ua.random}

with open('urls.txt','r') as file:
    for url in file.readlines():
        result = requests.get(url,headers=header,timeout=3)
        src = result.content
        soup = BeautifulSoup(src, 'lxml')

Upvotes: 1

Views: 123

Answers (2)

Greg
Greg

Reputation: 4518

There is far too much going on in the code. I'm not sure what the actual issue is? Can you fetch url.txt? If so what does this contain?

As a starting point try separate your code into methods.

For example:

import requests
import lxml.html
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed

def getReadMe():
    with open('urls.txt','r') as file:
        return file.read()

def getHtml(readMe):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(readMe,headers=header,timeout=3)
    response.raise_for_status() # throw error for 4xx & 5xx
    return response.content

readMe = getReadMe()
print(readMe) #TODO: does this output text? If so what is it?
html = getHtml(readMe)
soup = BeautifulSoup(src, 'lxml')
# TODO: what is in the response html?

Upvotes: 1

chrisharrel
chrisharrel

Reputation: 336

You need to loop over them. This code assumes there is one url per line in your file:

import requests
import lxml.html
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed

ua = UserAgent()
header = {'user-agent':ua.random}

with open('urls.txt','r') as file:
    for url in file.readlines():
        result = requests.get(url,headers=header,timeout=3)
        src = result.content
        soup = BeautifulSoup(src, 'lxml')

Upvotes: 1

Related Questions