Reputation: 958
I have a simple LinkExtractor
rule for a given domain. Something like this: Rule(LinkExtractor(allow=('domain\.com/.+/\d+', )), callback='parse_page'),
What I would like and I can't figure out, it's to know on which position was the link in the page.
For example, if a given domain has 5 links on the page that match my rule I need to know from the top to the bottom their order in the HTML.
I found many questions about the order of the extraction, but nothing, or I misunderstood something, about the order of the link itself in the HTML
Upvotes: 0
Views: 43
Reputation: 21446
Scrapy uses lxml to for html parsing. LinkExtractor
uses root.iter()
to iterate through. This line to be more exact.
Elements provide a tree iterator for this purpose. It yields elements in document order, i.e. in the order their tags would appear if you serialised the tree to XML:
so for html source:
<root>
<child>Child 1</child>
<child>Child 2</child>
<another>Child 3</another>
</root>
it would yield:
>>> for element in root.iter(tag=etree.Element):
... print("%s - %s" % (element.tag, element.text))
root - None
child - Child 1
child - Child 2
another - Child 3
You can replicate the process using the examples provided in the lxml docs link posted above.
Upvotes: 1