Horai Nuri
Horai Nuri

Reputation: 5578

Python BeautifulSoup Spider is not working

Hi I'm trying to learn how to scrap elements with python, and I was trying to get the title of a web page (local.ch) but my code is not working and I don't know why.

here the python code:

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 2
    while page < max_pages:
        url = 'http://yellow.local.ch/fr/q/Morges/Bar.html?page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in soup.findAll('a', {'class':'details-entry-title-link'}):
            title = link.string
            print(title)
        page += 1

spider(3)

I'm pretty sure that the code is correct I don't have any error on pycharm, why is it not working?

Upvotes: 0

Views: 1462

Answers (3)

Renae Lider
Renae Lider

Reputation: 1024

You have a major bug in your code:

page = 1
while page < max_pages
....
spider(1)

The condition is never met, and the rest of your code doesn't get executed! Some other bugs are encoding error and unspecified parser warnings:

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://yellow.local.ch/fr/q/Morges/Bar.html?page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text.encode("utf-8")
        soup = BeautifulSoup(plain_text, 'html.parser')
        for link in soup.findAll('a', {'class':'details-entry-title-link'}):
            title = link.string
            print(title.encode("utf-8"))
        page += 1

spider(1)

Note the encoding "utf-8" part - this encoding will result in binary output, as you can see from the b prefix. Without this step, the print() function will throw an error. The same change is made on plain_textplain_text = source_code.text.encode("utf-8") line.

Another bug is the wrong indentation of page += 1 line. It should be inside the while loop.

Upvotes: 2

Ben Beirut
Ben Beirut

Reputation: 763

To the function spider you are passing 1 as the max_pages argument. However, your while loop will only execute if page < max_pages. 1 < 1 is not true.

Upvotes: 1

Vivek Anand
Vivek Anand

Reputation: 651

Probably because, you intended to initialize the page variable from 0 not 1. Presently, it never enters the loop. Because, both page and max page do have same value which is 1.

Upvotes: 1

Related Questions