Reputation: 419
I have one HTML file from which i want to extract anchor href values under specific DIV. HTML file looks like this
<html>
<head>
<title>Test page Vikrant </title>
</head>
<body>
<div class="mainContainer">
<a href="https://india.net" class="logoShape">India</a>
<nav id="vik1">
<a href="https://aarushmay.com" class="closemobilemenu">home</a>
<ul class="mainNav">
<li class="hide-submenu">
<a class="comingsoon1" href="https://aarushmay.com/fashion">Fashion </a>
</li>
</ul>
</nav>
<a href="https://maharashtra.net" class="logoShape">Maharashtra</a>
</div>
</body>
The spider code is as below
import os
import scrapy
from scrapy import Selector
class QuotesSpider(scrapy.Spider):
name = "test"
localfile_folder="localfiles"
def start_requests(self):
testFile = f'{self.localfile_folder}/t1.html'
absoluteFileName = os.path.abspath(testFile)
yield scrapy.Request(url=f'file:.///{absoluteFileName}', callback=self.parse)
def parse(self, response):
hrefElements = response.xpath('//nav[@id="vik1"]').xpath('//a/@href').getall()
self.log(f'total records = {len(hrefElements)}')
The output i am getting is 4 anchor element. whereas i am expecting it to be 2. So i used "Selector" and stored the Div element in that and then try to extract the values of anchor elements. It worked fine.
import os
import scrapy
from scrapy import Selector
class QuotesSpider(scrapy.Spider):
name = "test"
localfile_folder="localfiles"
def start_requests(self):
testFile = f'{self.localfile_folder}/t1.html'
absoluteFileName = os.path.abspath(testFile)
yield scrapy.Request(url=f'file:.///{absoluteFileName}', callback=self.parse)
def parse(self, response):
listingDataSel = response.xpath('//nav[@id="vik1"]')
exactElement = Selector(text=listingDataSel.get())
hrefElements = exactElement.xpath('//a/@href').getall()
self.log(f'total records = {len(hrefElements)}')
My question is why do i need to use intermediate Selector variable to store the extracted Div element?
Upvotes: 0
Views: 754
Reputation: 589
You can also use CSS Selectors to extract the elements.
response.css('nav[id = "vik1"] a::attr(href)').getall()
This will give you the href
values you are looking for.
Also, as per W3C standards, CSS selectors do not support selecting text nodes or attribute values. Here are some Extensions to CSS Selectors that scrapy selectors provide which can be quite useful.
to select text nodes, use ::text
to select attribute values, use ::attr(name)
where name
is the name of the attribute that you want the value of.
Upvotes: 2
Reputation: 146
When you did:
exactElement = Selector(text=listingDataSel.get())
you are creating a Selector which include just what you extracted in listingDataSel.get()
but as follow:
<html>
<body>
<nav id="vik1">
<a href="https://aarushmay.com" class="closemobilemenu">home
</a>
<ul class="mainNav">
<li class="hide-submenu">
<a class="comingsoon1" href="https://aarushmay.com/fashion">Fashion
</a>
</li>
</ul>
</nav>
</body>
</html>
When you use the text
parameter you created a new HTML doc, that's why you obtain just two anchor elements. You can check some examples at this link.
In your first code, you obtained 4 anchor elements because you are working with the original document. You can try this too:
response.xpath('//div/nav[@id="vik1"]//a/@href').extract()
and you can obtain the same result.
Upvotes: 1
Reputation: 212
did you try already to target the class div name?, For example, to get the text from the anchor elements in your HTML code is as follows.
response.xpath('//div[@class = "mainContainer"]/a/text()').extract()
From there, you just target the Href and u got them.
Check the documentation here
Upvotes: 0