Reputation: 549
I am working on a Scrapy spider, in which xpath is used to extract information needed. The source page was first generated by using the website's search function. For example, my interest is to get the items with "computer" in the title. On the source page, all the "computer" is in bold because of the search process. And "computer" could be in the beginning, or the middle or the end of the titles. Some items don't have "computer" in the title. See the examples below:
Example 1: ("computer" at the beginning)
<a class="title" href="whatever1">
<strong> Computer </strong>
, used
</a>
Example 2: ("computer" in the middle)
<a class="title" href="whatever2">
Low price
<strong> computer </strong>
, great deal
</a>
Example 3: ("computer" at the end)
<a class="title" href="whatever3">
Don't miss this
<strong> Computer </strong>
</a>
Example 4: (no keyword of "computer")
<a class="title" href="whatever4">
Best laptop deal ever!
</a>
The xpath code I tried .//a[@class="title"]/text()
will only generate the portion AFTER the strong
element. For the above 4 examples, I will get the following results:
Example 1:
, used
Example 2:
, great deal
Example 3: (Nothing)
Example 4:
Best laptop deal ever!
I need a xpath code to cover all these four situation and collect the full titles of each item.
Upvotes: 4
Views: 1957
Reputation: 473763
The simplest approach would be to search for all "text" nodes and "join" them:
"".join(response.xpath('.//a[@class="title"]//text()').extract())
Note the double slash before the text()
this is the key fix here.
Upvotes: 6