Howard
Howard

Reputation: 1

xpath could not recognize predicate for a tag

I try to use scrapy xpath to scrape a page, but it seems it cannot capture the tag with predicates when I use a for loop, # This package will contain the spiders of your Scrapy project

from cunyfirst.items import CunyfirstSectionItem
import scrapy
import json

class CunyfristsectionSpider(scrapy.Spider):
name = "cunyfirst-section-spider"
start_urls = ["file:///Users/haowang/Desktop/section.htm"]

def parse(self, response):
    url = response.url
    yield scrapy.Request(url, self.parse_page)

def parse_page(self, response):

    n = -1
    for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]"):
        print(response.xpath("//a[@name ='MTG_CLASSNAME$10']/text()"))

        n += 1

        class_num = section.xpath('text()').extract_first()
        # print(class_num)
        classname = "MTG_CLASSNAME$" + str(n)
        date = "MTG_DAYTIME$" + str(n)
        instr = "MTG_INSTR$" + str(n)
        print(classname)

        class_name = response.xpath("//a[@name = classname]/text()")

I am looking for a tags with name as "MTG_CLASSNAME$" + str(n), with n being 0,1,2..., and I am getting empty output from my xpath query. Not sure why...

PS. I am basically trying to scrape course and their info from https://hrsa.cunyfirst.cuny.edu/psc/cnyhcprd/GUEST/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?FolderPath=PORTAL_ROOT_OBJECT.HC_CLASS_SEARCH_GBL&IsFolder=false&IgnoreParamTempl=FolderPath%252cIsFolder&PortalActualURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentProvider=HRMS&PortalCRefLabel=Class%20Search&PortalRegistryName=GUEST&PortalServletURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsp%2fcnyepprd%2f&PortalURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsc%2fcnyepprd%2f&PortalHostNode=ENTP&NoCrumbs=yes with filter applied: Kingsborough CC, fall 18, BIO

Thanks!

Upvotes: 0

Views: 46

Answers (1)

xpeiro
xpeiro

Reputation: 751

Well... I've visited the website you put in the question description, I used element inspection and searched for "MTG_CLASSNAME" and I got 0 matches...

So I will give you some tools:

  • In your settings.py set that:

    LOG_FILE = "log.txt"

    LOG_STDOUT=True

    then print the response body ( response.body ) where you should ( in the top of parse_page function in this case ) and search it in log.txt

  • Check there if there is what you are looking for.

  • If there is, use this https://www.freeformatter.com/xpath-tester.html ( or similar ) to check your xpath statement.

In addition, change for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]"): by for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]").extract():, this will raise an error when you get the data that you are looking for.

Upvotes: 1

Related Questions