Order of the link in the page with Scrapy

Question

I have a simple LinkExtractor rule for a given domain. Something like this: Rule(LinkExtractor(allow=('domain\.com/.+/\d+', )), callback='parse_page'),

What I would like and I can't figure out, it's to know on which position was the link in the page.

For example, if a given domain has 5 links on the page that match my rule I need to know from the top to the bottom their order in the HTML.

I found many questions about the order of the extraction, but nothing, or I misunderstood something, about the order of the link itself in the HTML

Granitosaurus · Accepted Answer

Scrapy uses lxml to for html parsing. LinkExtractor uses root.iter() to iterate through. This line to be more exact.

Lxml's docs say:

Elements provide a tree iterator for this purpose. It yields elements in document order, i.e. in the order their tags would appear if you serialised the tree to XML:

so for html source:


  Child 1
  Child 2
  Child 3

it would yield:

>>> for element in root.iter(tag=etree.Element):
...     print("%s - %s" % (element.tag, element.text))
root - None
child - Child 1
child - Child 2
another - Child 3

You can replicate the process using the examples provided in the lxml docs link posted above.

Order of the link in the page with Scrapy

Answers (1)

Related Questions