Reputation: 22440
I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[@class='sidebar-item-label']/@href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[@class='name-info']"):
links = titles.xpath(".//a[@class='pro-title']/@href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[@class='container']"):
name = if_exist(titles,".//a[@class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
print(name,phone,web)
category_links(url)
Upvotes: 0
Views: 103
Reputation: 15376
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href")
returns an empty list and the profile_pagelink
function never gets executed.
As a quick fix you can handle this case separately in the category_links
function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[@class='sidebar-item-label']/@href"):
next_pagelink(titles)
Also i noticed that the target_pagelink
prints a lot of empty strings as a result of if_exist
returning ""
. You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[@class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[@class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session
is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get
and have the same results.
Upvotes: 1