Reputation: 95
I'm currently creating a cutsom webcrawler with Scrapy and try to index the fetched content with Elasticsearch. Works fine until as of now, but I'm only capable of adding content to the search index in the order the crawler filters html tags. So for example with
sel.xpath("//div[@class='article']/h2//text()").extract()
I can get all the content from all h2 tags inside a div with the class "article", so far so good. The next elements that get inside the index are from all h3 tags, naturally:
sel.xpath("//div[@class='article']/h3//text()").extract()
But the problem here is that the entire order of the text on a site would get messed up like that, since all headlines would get indexed first and only then their child nodes get the chance, which is kind of fatal for a search index. Does have a tip how to properly get all the content from a page in the right order? (doesnt have to be xpath, just with Scrapy)
Upvotes: 0
Views: 599
Reputation: 14731
I guess you could solve the issue with something like this:
# Select multiple targeting nodes at once
sel_raw = '|'.join([
"//div[@class='article']/h2",
"//div[@class='article']/h3",
# Whatever else you want to select here
])
for sel in sel.xpath(sel_raw):
# Extract the texts for later use
texts = sel.xpath('self::*//text()').extract()
if sel.xpath('self::h2'):
# A h2 element. Do something with texts
pass
elif sel.xpath('self::h3'):
# A h3 element. Do something with texts
pass
Upvotes: 1