user1592380
user1592380

Reputation: 36327

Scrapy Xpath with text() contains

I'm using scrapy, and I'm trying to look for a span that contains a specific text. I have:

response.selector.xpath('//*[@class="ParamText"]/span/node()')

which returns:

<Selector xpath='//*[@class="ParamText"]/span/text()' data=u' MILES STODOLINK'>,
<Selector xpath='//*[@class="ParamText"]/span/text()' data=u'C'>,

<Selector xpath='//*[@class="ParamText"]/span/text()' data=u'  MILES STODOLINK'>]

However when I run:

>>> response.selector.xpath('//*[@class="ParamText"]/span[contains(text(),"STODOLINK")]')
Out[11]: []

Why does the contains function not work?

Upvotes: 9

Views: 19192

Answers (3)

sarc360
sarc360

Reputation: 349

I use Scrapy with BeautifulSoup4.0. IMO, Soup is easy to read and understand. This is an option if you don't have to use HtmlXPathSelector. Below is an example for finding all links. You can replace that with 'span'. Hope this helps!

import scrapy
from bs4 import BeautifulSoup
import Item

def parse(self, response):

    soup = BeautifulSoup(response.body,'html.parser')
    print 'Current url: %s' % response.url
    item = Item()
    for link in soup.find_all('a'):
        if link.get('href') is not None:
            url = response.urljoin(link.get('href'))
            item['url'] = url
            yield scrapy.Request(url,callback=self.parse)
            yield item

Upvotes: 2

Carl H
Carl H

Reputation: 1036

In my terminal (assuming my example is identical to your file though) your code works:

Input

import scrapy
example='<div class="ParamText"><span>STODOLINK</span></div>'
scrapy.Selector(text=example).xpath('//*[@class="ParamText"]/span[contains(text(),"STODOLINK")]').extract()

Output:

['<span>STODOLINK</span>']

Can you clarify what might be different?

Upvotes: 2

har07
har07

Reputation: 89325

contains() can not evaluate multiple nodes at once :

/span[contains(text(),"STODOLINK")]

So, in case there are multiple text nodes within the span, and "STODOLINK" isn't located in the first text node child of the span, then contains() in the above expression won't work. You should try to apply contains() check on individual text nodes as follow :

//*[@class="ParamText"]/span[text()[contains(.,"STODOLINK")]]

Or if "STODOLINK" isn't necessarily located directly within span (can be nested within other element in the span), then you can simply use . instead of text() :

//*[@class="ParamText"]/span[contains(.,"STODOLINK")]

Upvotes: 16

Related Questions