Reputation: 36327
I'm using scrapy, and I'm trying to look for a span that contains a specific text. I have:
response.selector.xpath('//*[@class="ParamText"]/span/node()')
which returns:
<Selector xpath='//*[@class="ParamText"]/span/text()' data=u' MILES STODOLINK'>,
<Selector xpath='//*[@class="ParamText"]/span/text()' data=u'C'>,
<Selector xpath='//*[@class="ParamText"]/span/text()' data=u' MILES STODOLINK'>]
However when I run:
>>> response.selector.xpath('//*[@class="ParamText"]/span[contains(text(),"STODOLINK")]')
Out[11]: []
Why does the contains function not work?
Upvotes: 9
Views: 19192
Reputation: 349
I use Scrapy with BeautifulSoup4.0. IMO, Soup is easy to read and understand. This is an option if you don't have to use HtmlXPathSelector. Below is an example for finding all links. You can replace that with 'span'. Hope this helps!
import scrapy
from bs4 import BeautifulSoup
import Item
def parse(self, response):
soup = BeautifulSoup(response.body,'html.parser')
print 'Current url: %s' % response.url
item = Item()
for link in soup.find_all('a'):
if link.get('href') is not None:
url = response.urljoin(link.get('href'))
item['url'] = url
yield scrapy.Request(url,callback=self.parse)
yield item
Upvotes: 2
Reputation: 1036
In my terminal (assuming my example is identical to your file though) your code works:
Input
import scrapy
example='<div class="ParamText"><span>STODOLINK</span></div>'
scrapy.Selector(text=example).xpath('//*[@class="ParamText"]/span[contains(text(),"STODOLINK")]').extract()
Output:
['<span>STODOLINK</span>']
Can you clarify what might be different?
Upvotes: 2
Reputation: 89325
contains()
can not evaluate multiple nodes at once :
/span[contains(text(),"STODOLINK")]
So, in case there are multiple text nodes within the span
, and "STODOLINK"
isn't located in the first text node child of the span
, then contains()
in the above expression won't work. You should try to apply contains()
check on individual text nodes as follow :
//*[@class="ParamText"]/span[text()[contains(.,"STODOLINK")]]
Or if "STODOLINK"
isn't necessarily located directly within span
(can be nested within other element in the span
), then you can simply use .
instead of text()
:
//*[@class="ParamText"]/span[contains(.,"STODOLINK")]
Upvotes: 16