Reputation: 99
I'm writing a python script to automatically check dog re-homing sites for dogs that we might be able to adopt as they become available, however I'm stuck completing the form data on this site and can't figure out why.
The form attributes state it should have a post method and I've gone through all of the inputs for the form and created a payload.
I expect the page with the search results to be returned and the html scraped from the results page so I can start processing it, but the scrape is just the form page and never has the results.
I've tried using .get with the payload as params, the url with the payload and using the requests-html library to render any java script elements without success.
If you paste the url_w_payload into a browser it loads the page and says one of the fields is empty. If you then press enter in the url bar again to reload the page without modifying the url it loads... something to do with cookies maybe?
import requests
from requests_html import HTMLSession
session = HTMLSession()
form_url = "https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search"
url_w_payload = "https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search&noPageView=false&animalType=DOG&freshSearch=false&arrivalSort=false&previousAnimalType=&location=WC2N5DU&previousLocation=&prevSearchedPostcode=&postcode=WC2N5DU&searchedLongitude=-0.1282688&searchedLatitude=51.5072106"
payload = {'noPageView': 'false','animalType': 'DOG', 'freshSearch': 'false', 'arrivalSort': 'false', 'previousAnimalType': '', 'location': 'WC2N5DU', 'previousLocation': '','prevSearchedPostcode': '', 'postcode': 'WC2N5DU', 'searchedLongitude': '-0.1282688', 'searchedLatitude': '51.5072106'}
#req = requests.post(form_url, data = payload)
#with open("requests_output.txt", "w") as f:
# f.write(req.text)
ses = session.post(form_url, data = payload)
ses.html.render()
with open("session_output.txt", "w") as f:
f.write(ses.text)
print("Done")
Upvotes: 2
Views: 3930
Reputation: 20042
There's a few hoops to jump with cookies and headers but once you get those right, you'll get the proper response.
Here's how to do it:
import time
from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup
query_string = {
"p_p_id": "petSearch2016_WAR_ptlPetRehomingPortlets",
"p_p_lifecycle": 1,
"p_p_state": "normal",
"p_p_mode": "view",
"_petSearch2016_WAR_ptlPetRehomingPortlets_action": "search",
}
payload = {
'noPageView': 'false',
'animalType': 'DOG',
'freshSearch': 'false',
'arrivalSort': 'false',
'previousAnimalType': '',
'location': 'WC2N5DU',
'previousLocation': '',
'prevSearchedPostcode': '',
'postcode': 'WC2N5DU',
'searchedLongitude': '-0.1282688',
'searchedLatitude': '51.5072106',
}
def make_cookies(cookie_dict: dict) -> str:
return "; ".join(f"{k}={v}" for k, v in cookie_dict.items())
with requests.Session() as connection:
main_url = "https://www.rspca.org.uk"
connection.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) " \
"AppleWebKit/537.36 (KHTML, like Gecko) " \
"Chrome/90.0.4430.212 Safari/537.36"
r = connection.get(main_url)
cookies = make_cookies(r.cookies.get_dict())
additional_string = f"; cb-enabled=enabled; " \
f"LFR_SESSION_STATE_10110={int(time.time())}"
post_url = f"https://www.rspca.org.uk/findapet?{urlencode(query_string)}"
connection.headers.update(
{
"cookie": cookies + additional_string,
"referer": post_url,
"content-type": "application/x-www-form-urlencoded",
}
)
response = connection.post(post_url, data=urlencode(payload)).text
dogs = BeautifulSoup(response, "lxml").find_all("a", class_="detailLink")
print("\n".join(f"{main_url}{dog['href']}" for dog in dogs))
Output (shortened for brevity and no need to paginate the page as all dogs come in the response):
https://www.rspca.org.uk/findapet/details/-/Animal/JAY_JAY/ref/217747/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/STORM/ref/217054/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/DASHER/ref/205702/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/EVE/ref/205701/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/SEBASTIAN/ref/178975/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/FIJI/ref/169578/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/ELLA/ref/154419/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/BEN/ref/217605/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/SNOWY/ref/214416/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/BENSON/ref/215141/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/BELLA/ref/207716/rehome/
and much more ...
PS. I really enjoyed this challenge as I have two dogs from a shelter. Keep it up, man!
Upvotes: 2