Pete Marise
Pete Marise

Reputation: 99

trying to loop through a list of urls and scrape each page for text

I'm having an issue. It loops through the list of URLS, but it's not adding the text content of each page scraped to the presults list.

I haven't gotten to the raw text processing yet. I'll probably make a question for that once I get there if I can't figure out.

What is wrong here? The length of presults remains at 1 even though it seems to be looping through the list of urls for the scrape...

Here's part of the code I'm having an issue with:

counter=0
for xa in range(0,len(qresults)):
        pageURL=qresults[xa].format()
        pageresp= requests.get(pageURL, headers=headers)
        if pageresp.status_code==200:
                print(pageURL)
                psoup=BeautifulSoup(pageresp.content, 'html.parser')
                presults=[]
                para=psoup.text
                presults.append(para)
                print(len(presults))
        else: print("Could not reach domain")
print(len(presults))

Upvotes: 0

Views: 1241

Answers (2)

Prune
Prune

Reputation: 77910

Your immediate problem is here:

            presults=[]
            para=psoup.text
            presults.append(para)

On every for iteration, you replace your existing presults list with the empty list and add one item. On the next iteration, you again wipe out the previous result.

Your initialization must be done only once and that before the loop:

presults = []
for xa in range(0,len(qresults)):

Upvotes: 1

ASH
ASH

Reputation: 20362

Ok, I don't even see you looping through any URLs here, but below is a generic example of how this kind of request can be achieved.

import requests
from bs4 import BeautifulSoup

base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1

while current_page < 200:
    print(current_page)
    url = base_url + str(current_page)
    #current_page += 1
    r = requests.get(url)
    zute_soup = BeautifulSoup(r.text, 'html.parser')
    firme = zute_soup.findAll('div', {'class': 'jobs-item'})

    for title in firme:
        title1 = title.findAll('h6')[0].text
        print(title1)
        adresa = title.findAll('div', {'class': 'description'})[0].text
        print(adresa)
        kontakt = title.findAll('div', {'class': 'description'})[1].text
        print(kontakt)
        print('\n')
        page_line = "{title1}\n{adresa}\n{kontakt}".format(
            title1=title1,
            adresa=adresa,
            kontakt=kontakt
        )
    current_page += 1

Upvotes: 0

Related Questions