Reputation: 3
it doesn't seem to go past the first page. What's wrong? Also if the word you're looking for is in the link it won't provide the right occurences it will display 5 outputs with 5 as the occurence
import requests from bs4 import BeautifulSoup
for i in range (1,5):
url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i)
the_word = 'is'
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)
count = len(words)
print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))
Upvotes: 0
Views: 83
Reputation: 84465
As an an aside, the search word has its own class name so you can just count those. The below correctly returns for where not found on page. You could use this approach within your loop.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nairaland.com/search?q=afonja&board=0&topicsonly=2')
soup = bs(r.content, 'lxml')
occurrences = len(soup.select('.highlight'))
print(occurrences)
import requests
from bs4 import BeautifulSoup as bs
for i in range(9):
r = requests.get('https://www.nairaland.com/search/afonja/0/0/0/{}'.format(i))
soup = bs(r.content, 'lxml')
occurrences = len(soup.select('.highlight'))
print(occurrences)
Upvotes: 0
Reputation: 8913
Try:
import requests
from bs4 import BeautifulSoup
for i in range(6):
url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i)
the_word = 'afonja'
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)
count = 0
if words:
count = len(words)
print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))
EDIT after new specifications.
Assuming the word to count is the same as in the url, you can note that the word is highlighted in the page, and recognizable by span class=highlight
in the html.
So you can use this code:
import requests
from bs4 import BeautifulSoup
for i in range(6):
url = 'https://www.nairaland.com/search/afonja/0/0/0/{}'.format(i)
the_word = 'afonja'
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
count = len(soup.find_all('span', {'class':'highlight'}))
print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))
and you get:
Url: https://www.nairaland.com/search/afonja/0/0/0/0
contains 30 occurrences of word: afonja
Url: https://www.nairaland.com/search/afonja/0/0/0/1
contains 31 occurrences of word: afonja
Url: https://www.nairaland.com/search/afonja/0/0/0/2
contains 36 occurrences of word: afonja
Url: https://www.nairaland.com/search/afonja/0/0/0/3
contains 30 occurrences of word: afonja
Url: https://www.nairaland.com/search/afonja/0/0/0/4
contains 45 occurrences of word: afonja
Url: https://www.nairaland.com/search/afonja/0/0/0/5
contains 50 occurrences of word: afonja
Upvotes: 0
Reputation: 304
To me this works fine:
import requests
from bs4 import BeautifulSoup
if __name__ == "__main__":
# correct the range, 0, 6 to go from first page to the fifth one (starting counting from "0")
# or try 0, 5 to go from 0 to 5 (five pages in total)
for i in range(0, 6): # range(0, 4)
url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i)
print(url, "url")
the_word = 'is'
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)
count = len(words)
print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))
This is the output:
https://www.nairaland.com/search/ipob/0/0/0/0 url
is somewhere in Europe sending semi nude video on the internet.Are you proud of such groups with such leader?
Url: https://www.nairaland.com/search/ipob/0/0/0/0
contains 110 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/1 url
Notre is a French word; means 'Our"...and Dame means "Lady" So Notre Dame means Our Lady.
Url: https://www.nairaland.com/search/ipob/0/0/0/1
contains 89 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/2 url
How does all this uselessness Help Foolish
Url: https://www.nairaland.com/search/ipob/0/0/0/2
contains 43 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/3 url
Dumb fuckers everywhere. I thought I was finally going to meet someone that has juju and can show me. Instead I got a hopeless broke buffoon that loves boasting online. Nairaland I apologize on the behalf of this waste of space and time. He is not even worth half of the data I have spent writing this post.
Url: https://www.nairaland.com/search/ipob/0/0/0/3
contains 308 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/4 url
People like FFK, Reno, Fayose etc have not been touched, it is an unknown prophet that hasn't said anything against the FG that you expect the FG to waste its time on.
Url: https://www.nairaland.com/search/ipob/0/0/0/4
contains 168 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/5 url
children send them to prison
Url: https://www.nairaland.com/search/ipob/0/0/0/5
contains 29 occurrences of word: is
Process finished with exit code 0
Upvotes: 0
Reputation: 1493
If you want to go past the first 6 pages, change the range in your loop:
for i in range (6): # the first page is addressed at index `0`
or:
for i in range (0,6):
instead of:
for i in range (1,5): # this will start from the second page, since the second page is indexed at `1`
Upvotes: 1