Arefe
Arefe

Reputation: 1059

How to use BeautifulSoup to scrape a webpage url

I get started with web scraping and I would like to get the URLs from certain page provided below.

import requests
from bs4 import BeautifulSoup as Soup

page = "http://www.zillow.com/homes/for_sale/fore_lt/2-_beds/any_days/globalrelevanceex_sort/57.610107,-65.170899,15.707662,-128.452149_rect/3_zm/"    

response = requests.get(page)
soup = Soup(response.text)

Now, I have all the info of the page in the soup content and I would like to get URLs of all the homes provided in the image enter image description here

When, I INSPECT any of the videos of the home, the chrome opens this DOM element in the image:

enter image description here

How would I get the link inside the <a href=""> tag using the soup? I think the parent is <div id = "lis-results">, but, I need a way to navigate to the element. Actually, I need all the URLs (391,479) of in a text file.

Zillow has an API and also, Python wrapper for the convenience of this kind of data job and I'm looking the code now. All I need to get is the URLs for the FOR SALE -> Foreclosures and POTENTIAL LISTING -> Foreclosed and Pre-foreclosed informations.

Upvotes: 1

Views: 1804

Answers (1)

DdD
DdD

Reputation: 612

The issue is that the request you send doesn't get the URLs. In fact, if I look at the response (using e.g. jupyter) I get: Missing the important part

I would suggest a different strategy: these kind of websites often communicate via json files.

From the Network tab of Web Developer in Firefox you can find the URL to request the json file:

Firefox Network Tab

Now, with this file you can get all the information needed.

import json
page = "http://www.zillow.com/search/GetResults.htm?spt=homes&status=110001&lt=001000&ht=111111&pr=,&mp=,&bd=2%2C&ba=0%2C&sf=,&lot=,&yr=,&pho=0&pets=0&parking=0&laundry=0&income-restricted=0&pnd=0&red=0&zso=0&days=any&ds=all&pmf=1&pf=1&zoom=3&rect=-134340820,16594081,-56469727,54952386&p=1&sort=globalrelevanceex&search=maplist&disp=1&listright=true&isMapSearch=true&zoom=3"
response = requests.get(page) # request the json file
json_response =  json.loads(response.text) # parse the json file
soup = Soup(json_response['list']['listHTML'], 'html.parser')

and the soup has what you are looking for. If you explore the json, you will find a lot of useful information. The list of all the URLs can be find with

links = [i.attrs['href'] for i in soup.findAll("a",{"class":"hdp-link"})] 

All the URLs appears twice. If you want that they are unique, you can fix the list, or, otherwise, look for "hdp-link routable" in class above. But, I always prefer more then less!

Upvotes: 3

Related Questions