Beautiful Soup
parameter

Question

I'm trying to print out the titles of each item on donedeal and copying the code from my own spider that works flawless at Over clockers and changing code accordingly:

import requests from bs4 import BeautifulSoup

def donedeal(max_pages):
    for i in range(1, max_pages+1):
        page = (i - 1) * 28
        url = 'https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page) # http:/?...
        source_code = requests.get(url)
        plain_text = source_code.content
        soup = BeautifulSoup(plain_text, "html.parser")
        for title in soup("p", {"class": "card__body-title"}):
            x = title.text
            print(x)

donedeal(1)

Page numbers go like: 0, 28, 56.. so I had to make page number change accordingly at top of function.

The issue is that nothing is ever printed and I get exit code 0. Thanks in advance. Edit2: Im trying to scrape from "< p class="card__body-title">Angus calves< /p >".

Christopher Apple · Accepted Answer

When inspecting the soup in pdb (break point before your for loop) I found:

(Pdb++) p soup



410 
Gone

Gone
The requested 
resource
/farming

is no longer available on this server and there is 
no forwarding address.
Please remove all references to this resource.

This probably means there is some anti-scraping measure in place! The site has detected that you're trying to scrape using python, and sent you to a page where you couldn't get any data.

In the future, I recommend using pdb to inspect the code, or perhaps printing out the Soup when you run into an issue! This can help clear up what happened, and show you what tags are available

EDIT:

Although I wouldn't necessarily recommend it (scraping is against donedeal's terms of service) there is a way to get around this.

If you feel like living on the wild side, you can make the requests module HTTP request look like it's coming from a real user, not a script. You can do this using the following:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

def donedeal(max_pages):
    for i in range(1, max_pages+1):
        page = (i - 1) * 28
        url = 'https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page) # http:/?...
        source_code = requests.get(url, headers=headers)
        plain_text = source_code.content
        soup = BeautifulSoup(plain_text, "html.parser")
        for title in soup("p", {"class": "card__body-title"}):
            x = title.text
            print(x)

donedeal(1)

All I did was tell the requests module to use the headers provided in headers. This makes the request look like it was coming from a Mac using Firefox.

I tested this and it seemed like it printed out the titles you want, no 410 error! :)

See this answer for more

Beautiful Soup <p> parameter

Answers (2)

Related Questions

Beautiful Soup &lt;p&gt; parameter

Answers (2)

Related Questions

Beautiful Soup <p> parameter