alienware13user
alienware13user

Reputation: 128

Beautiful Soup <p> parameter

I'm trying to print out the titles of each item on donedeal and copying the code from my own spider that works flawless at Over clockers and changing code accordingly:

import requests from bs4 import BeautifulSoup

def donedeal(max_pages):
    for i in range(1, max_pages+1):
        page = (i - 1) * 28
        url = 'https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page) # http:/?...
        source_code = requests.get(url)
        plain_text = source_code.content
        soup = BeautifulSoup(plain_text, "html.parser")
        for title in soup("p", {"class": "card__body-title"}):
            x = title.text
            print(x)

donedeal(1)

Page numbers go like: 0, 28, 56.. so I had to make page number change accordingly at top of function.

The issue is that nothing is ever printed and I get exit code 0. Thanks in advance. Edit2: Im trying to scrape from "< p class="card__body-title">Angus calves< /p >".

Upvotes: 2

Views: 1022

Answers (2)

BeigeBruceWayne
BeigeBruceWayne

Reputation: 127

You need to specify a different User-Agent in your request to make it seem like you're a real person, (i.e. headers={'User-Agent': 'Mozilla/5.0'} ). Once you do that your code works as intended.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

def donedeal(max_pages):
    for i in range(1, max_pages+1):
        page = (i - 1) * 28
        req = Request('https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page), headers={'User-Agent': 'Mozilla/5.0'})
        plain_text = urlopen(req).read()
        plain_text.decode('utf-8')
        soup = BeautifulSoup(plain_text, "html.parser")
        for title in soup("p", {"class": "card__body-title"}):
            x = title.text
            print(x)

donedeal(1)

Upvotes: 3

Christopher Apple
Christopher Apple

Reputation: 401

When inspecting the soup in pdb (break point before your for loop) I found:

(Pdb++) p soup
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n\n<html><head>\n<title>410 
Gone</title>\n</head><body>\n<h1>Gone</h1>\n<p>The requested 
resource<br/>/farming<br/>\nis no longer available on this server and there is 
no forwarding address.\nPlease remove all references to this resource.
</p>\n</body></html>\n

This probably means there is some anti-scraping measure in place! The site has detected that you're trying to scrape using python, and sent you to a page where you couldn't get any data.

In the future, I recommend using pdb to inspect the code, or perhaps printing out the Soup when you run into an issue! This can help clear up what happened, and show you what tags are available

EDIT:

Although I wouldn't necessarily recommend it (scraping is against donedeal's terms of service) there is a way to get around this.

If you feel like living on the wild side, you can make the requests module HTTP request look like it's coming from a real user, not a script. You can do this using the following:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

def donedeal(max_pages):
    for i in range(1, max_pages+1):
        page = (i - 1) * 28
        url = 'https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page) # http:/?...
        source_code = requests.get(url, headers=headers)
        plain_text = source_code.content
        soup = BeautifulSoup(plain_text, "html.parser")
        for title in soup("p", {"class": "card__body-title"}):
            x = title.text
            print(x)

donedeal(1)

All I did was tell the requests module to use the headers provided in headers. This makes the request look like it was coming from a Mac using Firefox.

I tested this and it seemed like it printed out the titles you want, no 410 error! :)

See this answer for more

Upvotes: 2

Related Questions