SIM
SIM

Reputation: 22440

Crawler skipping content of the first page

I've created a crawler which is parsing certain content from a website.

Firstly, it scrapes links to the category from left-sided bar.

secondly, it harvests the whole links spread through pagination connected to the profile page

And finally, going to each profile page it scrapes name, phone and web address.

So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:

import requests
from lxml import html

url="https://www.houzz.com/professionals/"

def category_links(mainurl):
    req=requests.Session()
    response = req.get(mainurl).text
    tree = html.fromstring(response)
    for titles in tree.xpath("//a[@class='sidebar-item-label']/@href"):
        next_pagelink(titles)   # links to the category from left-sided bar


def next_pagelink(process_links):
    req=requests.Session()
    response = req.get(process_links).text
    tree = html.fromstring(response)
    for link in tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href"):
        profile_pagelink(link)      # the whole links spread through pagination connected to the profile page


def profile_pagelink(procured_links):
    req=requests.Session()
    response = req.get(procured_links).text
    tree = html.fromstring(response)
    for titles in tree.xpath("//div[@class='name-info']"):
        links = titles.xpath(".//a[@class='pro-title']/@href")[0]
        target_pagelink(links)         # profile page of each link


def target_pagelink(main_links):
    req=requests.Session()
    response = req.get(main_links).text
    tree = html.fromstring(response)

    def if_exist(titles,xpath):
        info=titles.xpath(xpath)
        if info:
            return info[0]
        return ""

    for titles in tree.xpath("//div[@class='container']"):
        name = if_exist(titles,".//a[@class='profile-full-name']/text()")
        phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
        web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
        print(name,phone,web)

category_links(url)

Upvotes: 0

Views: 103

Answers (1)

t.m.adam
t.m.adam

Reputation: 15376

The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href") returns an empty list and the profile_pagelink function never gets executed.

As a quick fix you can handle this case separately in the category_links function :

def category_links(mainurl):
    response = requests.get(mainurl).text
    tree = html.fromstring(response)
    if mainurl == "https://www.houzz.com/professionals/": 
        profile_pagelink("https://www.houzz.com/professionals/")
    for titles in tree.xpath("//a[@class='sidebar-item-label']/@href"):
        next_pagelink(titles)   

Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :

for titles in tree.xpath("//div[@class='container']"):    # use class='profile-cover' if you get douplicates #
    name = if_exist(titles,".//a[@class='profile-full-name']/text()")
    phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
    web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
    if name+phone+web : 
        print(name,phone,web)

Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.

Upvotes: 1

Related Questions