Vikrant Korde
Vikrant Korde

Reputation: 419

Scrapy xpath giving all matching elements

I have one HTML file from which i want to extract anchor href values under specific DIV. HTML file looks like this

<html>
<head>
    <title>Test page Vikrant </title>
</head>
<body>
        <div class="mainContainer">
                <a href="https://india.net" class="logoShape">India</a>
                    <nav id="vik1">
                    <a href="https://aarushmay.com" class="closemobilemenu">home</a>
            <ul class="mainNav">
                    <li class="hide-submenu">
                        <a class="comingsoon1" href="https://aarushmay.com/fashion">Fashion </a>
                </li>
            </ul>
        </nav>
                <a href="https://maharashtra.net" class="logoShape">Maharashtra</a>
    </div>
</body>

The spider code is as below

import os
import scrapy
from scrapy import Selector
class QuotesSpider(scrapy.Spider):
  name = "test"
  localfile_folder="localfiles"
  def start_requests(self):
    testFile = f'{self.localfile_folder}/t1.html'
    absoluteFileName = os.path.abspath(testFile)
    yield scrapy.Request(url=f'file:.///{absoluteFileName}', callback=self.parse)
  def parse(self, response):
    hrefElements = response.xpath('//nav[@id="vik1"]').xpath('//a/@href').getall()
    self.log(f'total records = {len(hrefElements)}')

The output i am getting is 4 anchor element. whereas i am expecting it to be 2. So i used "Selector" and stored the Div element in that and then try to extract the values of anchor elements. It worked fine.

    import os
import scrapy
from scrapy import Selector
class QuotesSpider(scrapy.Spider):
  name = "test"
  localfile_folder="localfiles"
  def start_requests(self):
    testFile = f'{self.localfile_folder}/t1.html'
    absoluteFileName = os.path.abspath(testFile)
    yield scrapy.Request(url=f'file:.///{absoluteFileName}', callback=self.parse)
  def parse(self, response):
    listingDataSel = response.xpath('//nav[@id="vik1"]')
    exactElement = Selector(text=listingDataSel.get())
    hrefElements = exactElement.xpath('//a/@href').getall()
    self.log(f'total records = {len(hrefElements)}')

My question is why do i need to use intermediate Selector variable to store the extracted Div element?

Upvotes: 0

Views: 754

Answers (3)

Neha Setia Nagpal
Neha Setia Nagpal

Reputation: 589

You can also use CSS Selectors to extract the elements.

  • They are faster than XPath.
  • They are much easier to learn and implement.
  • Code looks much cleaner too.
response.css('nav[id = "vik1"] a::attr(href)').getall()

This will give you the href values you are looking for.

Also, as per W3C standards, CSS selectors do not support selecting text nodes or attribute values. Here are some Extensions to CSS Selectors that scrapy selectors provide which can be quite useful.

  • to select text nodes, use ::text

  • to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of.

Upvotes: 2

Brenda S.
Brenda S.

Reputation: 146

When you did:

exactElement = Selector(text=listingDataSel.get())

you are creating a Selector which include just what you extracted in listingDataSel.get() but as follow:

<html>
  <body>
    <nav id="vik1">                    
      <a href="https://aarushmay.com" class="closemobilemenu">home
      </a>            
      <ul class="mainNav">                    
        <li class="hide-submenu">                        
          <a class="comingsoon1" href="https://aarushmay.com/fashion">Fashion 
          </a>                
        </li>            
      </ul>        
    </nav>
  </body>
</html>

When you use the text parameter you created a new HTML doc, that's why you obtain just two anchor elements. You can check some examples at this link.

In your first code, you obtained 4 anchor elements because you are working with the original document. You can try this too:

response.xpath('//div/nav[@id="vik1"]//a/@href').extract()

and you can obtain the same result.

Upvotes: 1

Geomario
Geomario

Reputation: 212

did you try already to target the class div name?, For example, to get the text from the anchor elements in your HTML code is as follows.

response.xpath('//div[@class = "mainContainer"]/a/text()').extract() 

From there, you just target the Href and u got them.

Check the documentation here

Upvotes: 0

Related Questions