yvan
yvan

Reputation: 958

Order of the link in the page with Scrapy

I have a simple LinkExtractor rule for a given domain. Something like this: Rule(LinkExtractor(allow=('domain\.com/.+/\d+', )), callback='parse_page'),

What I would like and I can't figure out, it's to know on which position was the link in the page.

For example, if a given domain has 5 links on the page that match my rule I need to know from the top to the bottom their order in the HTML.

I found many questions about the order of the extraction, but nothing, or I misunderstood something, about the order of the link itself in the HTML

Upvotes: 0

Views: 43

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21446

Scrapy uses lxml to for html parsing. LinkExtractor uses root.iter() to iterate through. This line to be more exact.

Lxml's docs say:

Elements provide a tree iterator for this purpose. It yields elements in document order, i.e. in the order their tags would appear if you serialised the tree to XML:

so for html source:

<root>
  <child>Child 1</child>
  <child>Child 2</child>
  <another>Child 3</another>
</root>

it would yield:

>>> for element in root.iter(tag=etree.Element):
...     print("%s - %s" % (element.tag, element.text))
root - None
child - Child 1
child - Child 2
another - Child 3

You can replicate the process using the examples provided in the lxml docs link posted above.

Upvotes: 1

Related Questions