Reputation: 1111
This is part of Scrapy parser function, in loop i am extracting text and searching for some string in text. This code finds random items, i mean it parse but when i check manually there is no matchings in text - very strange, because if no match it should not go inside "if" and append it:
for site in set(sites):
if (re.findall(r'(Обам*)', " ".join(site.xpath('div/div/div').extract()), re.IGNORECASE)) !=None:
item['Target'] = unicode('Obama')
item['Label'] = unicode(" ".join(site.xpath('h3/a').extract()))
items.append(item)
When i use another approach
len(re.search(r'(Обам*)', " ".join(site.xpath('div/div/div').extract()), re.IGNORECASE)) !=0:
it simply doesn't work at all, no any parsed items, but i am sure it should find. If i remove "len" counting - it start work, but again randomly (!!!).
By the way if i use simple string function string.find() it works fine.
Edit: This is example Input (it's hard to exactly math input and output so this is some illusrating text, what is worth is that in that text no "Обама" but it saved with match Obama, "Target" : "Obama",):
<div class=\"b-text NahodkiStore-snippet\">\r\n\r\nОни оскорбили <b>Царева</b> не как частного человека, а как выразителя идей Юго-Востока.</div> <div class=\"b-text\"><div>Они оскорбили <b>Клинтон</b>\r\n не как частного человека, а как выразителя идей Юго-Востока. Они ясно \r\nдали понять, какое будущее они готовят русским на Украине.</div>
Output form MongoDB:
{
"_id" : ObjectId("538fa13abb88b114143d750b"),
"comment_datesaved" : ISODate("2014-06-05T02:44:01.749Z"),
"comment_text" : "<div class=\"b-text NahodkiStore-snippet\">\r\n\r\nОни оскорбили <b>Царева</b> не как частного человека, а как выразителя идей Юго-Востока.</div> <div class=\"b-text\"><div>Они оскорбили <b>Обаму</b>\r\n не как частного человека, а как выразителя идей Юго-Востока. Они ясно \r\nдали понять, какое будущее они готовят русским на Украине.</div>",
"Target" : "Obama",
"Label" : "<a href=\"http://mikle1.livejournal.com/3907742.html?thread=42077854\" class=\"NahodkiStore-link SearchStatistics-link\" target=\"_blank\">\r\n\r\nОни оскорбили <b>Царева</b> не как частного человека, а как выразителя идей Юго-Востока.</a>",
}
Upvotes: 0
Views: 91
Reputation: 180461
In [38]: import re
In [39]: s = "a string"
In [40]: re.findall("hello", s)== []
Out[40]: True
In [41]: re.findall("hello", s)==None
Out[41]: False
In [42]: re.findall("hello", s) != None
Out[42]: True
In [43]: re.findall("hello", s)
Out[43]: []
re.findall returns an empty list not None
You should use:
`if (re.findall(r'(Обам*)', " ".join(site.xpath('div/div/div').extract()),re.IGNORECASE))`.
Drop the != None
or you will always execute the code after the if statement.
In [49]: if re.findall("hello", s ):
print ("found")
....:
In [48]: if not re.findall("hello",s):
print ("not found")
....:
not found
The same is for re.search except it does return None if it finds no match but you should still just use if re.search(....
without any == or !=
:
In [64]: re.search("hello", s)!=0
Out[64]: True
In [65]: re.search("hello", s)==0
Out[65]: False
In [66]: re.search("hello", s)==None
Out[66]: True
Using len
on a search that returns None
will give you a TypeError
object of type 'NoneType' has no len()
If your strings are unicode you could specify a list of words and check if they are in your string.
words=[u'Обам' , u'Путин' ,u'OBAMA']
for word in words:
if re.search(word.encode("utf-8"), "".join(site.xpath('div/div/div').extract()))):
Upvotes: 3