Reputation: 128
I'm trying to print out the titles of each item on donedeal and copying the code from my own spider that works flawless at Over clockers and changing code accordingly:
import requests from bs4 import BeautifulSoup
def donedeal(max_pages):
for i in range(1, max_pages+1):
page = (i - 1) * 28
url = 'https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page) # http:/?...
source_code = requests.get(url)
plain_text = source_code.content
soup = BeautifulSoup(plain_text, "html.parser")
for title in soup("p", {"class": "card__body-title"}):
x = title.text
print(x)
donedeal(1)
Page numbers go like: 0, 28, 56..
so I had to make page number change accordingly at top of function.
The issue is that nothing is ever printed and I get exit code 0. Thanks in advance. Edit2: Im trying to scrape from "< p class="card__body-title">Angus calves< /p >".
Upvotes: 2
Views: 1022
Reputation: 127
You need to specify a different User-Agent in your request to make it seem like you're a real person, (i.e. headers={'User-Agent': 'Mozilla/5.0'} ). Once you do that your code works as intended.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
def donedeal(max_pages):
for i in range(1, max_pages+1):
page = (i - 1) * 28
req = Request('https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page), headers={'User-Agent': 'Mozilla/5.0'})
plain_text = urlopen(req).read()
plain_text.decode('utf-8')
soup = BeautifulSoup(plain_text, "html.parser")
for title in soup("p", {"class": "card__body-title"}):
x = title.text
print(x)
donedeal(1)
Upvotes: 3
Reputation: 401
When inspecting the soup in pdb (break point before your for loop) I found:
(Pdb++) p soup
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n\n<html><head>\n<title>410
Gone</title>\n</head><body>\n<h1>Gone</h1>\n<p>The requested
resource<br/>/farming<br/>\nis no longer available on this server and there is
no forwarding address.\nPlease remove all references to this resource.
</p>\n</body></html>\n
This probably means there is some anti-scraping measure in place! The site has detected that you're trying to scrape using python, and sent you to a page where you couldn't get any data.
In the future, I recommend using pdb
to inspect the code, or perhaps printing out the Soup when you run into an issue! This can help clear up what happened, and show you what tags are available
EDIT:
Although I wouldn't necessarily recommend it (scraping is against donedeal's terms of service) there is a way to get around this.
If you feel like living on the wild side, you can make the requests
module HTTP request look like it's coming from a real user, not a script. You can do this using the following:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
def donedeal(max_pages):
for i in range(1, max_pages+1):
page = (i - 1) * 28
url = 'https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page) # http:/?...
source_code = requests.get(url, headers=headers)
plain_text = source_code.content
soup = BeautifulSoup(plain_text, "html.parser")
for title in soup("p", {"class": "card__body-title"}):
x = title.text
print(x)
donedeal(1)
All I did was tell the requests
module to use the headers provided in headers
. This makes the request look like it was coming from a Mac using Firefox.
I tested this and it seemed like it printed out the titles you want, no 410 error! :)
See this answer for more
Upvotes: 2