introvertme
introvertme

Reputation: 3

My Webscraping code with beautifulsoup doesn't go past the first page

it doesn't seem to go past the first page. What's wrong? Also if the word you're looking for is in the link it won't provide the right occurences it will display 5 outputs with 5 as the occurence

import requests from bs4 import BeautifulSoup 

for i in range (1,5):

    url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i)
    the_word = 'is' 
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    words = soup.find(text=lambda text: text and the_word in text) 
    print(words) 
    count =  len(words)
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))

Upvotes: 0

Views: 83

Answers (4)

QHarr
QHarr

Reputation: 84465

As an an aside, the search word has its own class name so you can just count those. The below correctly returns for where not found on page. You could use this approach within your loop.

import requests 
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.nairaland.com/search?q=afonja&board=0&topicsonly=2')
soup = bs(r.content, 'lxml')
occurrences = len(soup.select('.highlight'))
print(occurrences)

import requests 
from bs4 import BeautifulSoup as bs

for i in range(9):
    r = requests.get('https://www.nairaland.com/search/afonja/0/0/0/{}'.format(i))
    soup = bs(r.content, 'lxml')
    occurrences = len(soup.select('.highlight'))
    print(occurrences)

Upvotes: 0

sentence
sentence

Reputation: 8913

Try:

import requests
from bs4 import BeautifulSoup 

for i in range(6):
    url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i)
    the_word = 'afonja' 
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    words = soup.find(text=lambda text: text and the_word in text) 
    print(words)
    count = 0
    if words:
        count = len(words)
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))

EDIT after new specifications.

Assuming the word to count is the same as in the url, you can note that the word is highlighted in the page, and recognizable by span class=highlight in the html.

So you can use this code:

import requests
from bs4 import BeautifulSoup 

for i in range(6):
    url = 'https://www.nairaland.com/search/afonja/0/0/0/{}'.format(i)
    the_word = 'afonja' 
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    count = len(soup.find_all('span', {'class':'highlight'})) 
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))

and you get:

Url: https://www.nairaland.com/search/afonja/0/0/0/0
contains 30 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/1
contains 31 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/2
contains 36 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/3
contains 30 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/4
contains 45 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/5
contains 50 occurrences of word: afonja

Upvotes: 0

Michele Rava
Michele Rava

Reputation: 304

To me this works fine:

import requests
from bs4 import BeautifulSoup

if __name__ == "__main__":

    # correct the range, 0, 6 to go from first page to the fifth one (starting counting from "0")
    # or try 0, 5 to go from 0 to 5 (five pages in total)
    for i in range(0, 6): # range(0, 4)

        url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i)
        print(url, "url")
        the_word = 'is'
        r = requests.get(url, allow_redirects=False)
        soup = BeautifulSoup(r.content, 'lxml')
        words = soup.find(text=lambda text: text and the_word in text)
        print(words)
        count =  len(words)
        print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))

This is the output:

https://www.nairaland.com/search/ipob/0/0/0/0 url
 is somewhere in Europe sending semi nude video on the internet.Are you proud of such groups with such leader?

Url: https://www.nairaland.com/search/ipob/0/0/0/0
contains 110 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/1 url
Notre is a French word; means 'Our"...and Dame means "Lady" So Notre Dame means Our Lady.

Url: https://www.nairaland.com/search/ipob/0/0/0/1
contains 89 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/2 url
How does all this uselessness Help Foolish 

Url: https://www.nairaland.com/search/ipob/0/0/0/2
contains 43 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/3 url
Dumb fuckers everywhere. I thought I was finally going to meet someone that has juju and can show me. Instead I got a hopeless broke buffoon that loves boasting online. Nairaland I apologize on the behalf of this waste of space and time. He is not even worth half of the data I have spent writing this post. 

Url: https://www.nairaland.com/search/ipob/0/0/0/3
contains 308 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/4 url
People like FFK, Reno, Fayose etc have not been touched, it is an unknown prophet that hasn't said anything against the FG that you expect the FG to waste its time on. 

Url: https://www.nairaland.com/search/ipob/0/0/0/4
contains 168 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/5 url
 children send them to prison

Url: https://www.nairaland.com/search/ipob/0/0/0/5
contains 29 occurrences of word: is

Process finished with exit code 0

Upvotes: 0

YusufUMS
YusufUMS

Reputation: 1493

If you want to go past the first 6 pages, change the range in your loop:

for i in range (6):   # the first page is addressed at index `0`

or:

for i in range (0,6):

instead of:

for i in range (1,5):    # this will start from the second page, since the second page is indexed at `1`

Upvotes: 1

Related Questions