Olivia Lundy
Olivia Lundy

Reputation: 127

Beautifulsoup Looping through Variable Url

I'm trying to store some data that's scraped from a website. That urls are more than 100+ and similar each other. Because of that i tried to use something with %s tag in my code.

My e.g urls:

https://www.yahoo.com/lifestyle/tagged/food,
https://www.yahoo.com/lifestyle/tagged/sports,
https://www.yahoo.com/lifestyle/tagged/usa,
https://www.yahoo.com/lifestyle/tagged/health and goes on..

My Django+Bs4 Loop:

from django.core.management.base import BaseCommand
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from scraping.models import Job
import requests as req


header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'}

class Command(BaseCommand):
    def handle(self,  *args, **options):
        TAGS = ['economy', 'food', 'sports', 'usa', 'health']
        resp = req.get('https://www.yahoo.com/lifestyle/tagged/%s' % (TAGS),headers=header)
        soup = BeautifulSoup(resp.text, 'lxml')

        for i in range(len(soup)):
            titles = soup.findAll("div", {"class": "StretchedBox Z(1)"})
            
        print (titles)

Error message is:

TypeError: not all arguments converted during string formatting

I have been playing around with loops but am very new to this and am unable to work out how to loop it. What am I missing here? Can someone more knowledgeable point me in the right direction? Many thanks

Upvotes: 0

Views: 294

Answers (2)

Mitchell Olislagers
Mitchell Olislagers

Reputation: 1817

You can loop through your tags to send a request for each tag.

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'}

TAGS = ['economy', 'food', 'sports', 'usa', 'health']
for tag in TAGS:
    resp = requests.get(f"https://www.yahoo.com/lifestyle/tagged/{tag}", headers=header)
    print(len(resp.text))

#341723
#442712
#447413
#368508
#445326

Upvotes: 1

CryptoFool
CryptoFool

Reputation: 23079

It appears that you want to insert each of the values in TAGS individually and perform a request for each of them. So you need to loop over TAGS and submit a request for each one. I expect that you want something like this:

TAGS = ['economy', 'food', 'sports', 'usa', 'health']
for tag in TAGS:
    resp = req.get(f'https://www.yahoo.com/lifestyle/tagged/{tag}',headers=header)
    soup = BeautifulSoup(resp.text, 'lxml')
    <process the page>

Upvotes: 1

Related Questions