Reputation: 1059
I get started with web scraping and I would like to get the URLs from certain page provided below.
import requests
from bs4 import BeautifulSoup as Soup
page = "http://www.zillow.com/homes/for_sale/fore_lt/2-_beds/any_days/globalrelevanceex_sort/57.610107,-65.170899,15.707662,-128.452149_rect/3_zm/"
response = requests.get(page)
soup = Soup(response.text)
Now, I have all the info of the page in the soup content and I would like to get URLs of all the homes provided in the image
When, I INSPECT any of the videos of the home, the chrome opens this DOM element in the image:
How would I get the link inside the <a href="">
tag using the soup
? I think the parent is <div id = "lis-results">
, but, I need a way to navigate to the element. Actually, I need all the URLs (391,479) of in a text file.
Zillow has an API and also, Python wrapper for the convenience of this kind of data job and I'm looking the code now. All I need to get is the URLs for the FOR SALE -> Foreclosures and POTENTIAL LISTING -> Foreclosed and Pre-foreclosed informations.
Upvotes: 1
Views: 1804
Reputation: 612
The issue is that the request you send doesn't get the URLs. In fact, if I look at the response (using e.g. jupyter) I get:
I would suggest a different strategy: these kind of websites often communicate via json files.
From the Network
tab of Web Developer
in Firefox you can find the URL to request the json file:
Now, with this file you can get all the information needed.
import json
page = "http://www.zillow.com/search/GetResults.htm?spt=homes&status=110001<=001000&ht=111111&pr=,&mp=,&bd=2%2C&ba=0%2C&sf=,&lot=,&yr=,&pho=0&pets=0&parking=0&laundry=0&income-restricted=0&pnd=0&red=0&zso=0&days=any&ds=all&pmf=1&pf=1&zoom=3&rect=-134340820,16594081,-56469727,54952386&p=1&sort=globalrelevanceex&search=maplist&disp=1&listright=true&isMapSearch=true&zoom=3"
response = requests.get(page) # request the json file
json_response = json.loads(response.text) # parse the json file
soup = Soup(json_response['list']['listHTML'], 'html.parser')
and the soup has what you are looking for. If you explore the json, you will find a lot of useful information. The list of all the URLs can be find with
links = [i.attrs['href'] for i in soup.findAll("a",{"class":"hdp-link"})]
All the URLs appears twice. If you want that they are unique, you can fix the list, or, otherwise, look for "hdp-link routable"
in class above.
But, I always prefer more then less!
Upvotes: 3