Vic Nicethemer
Vic Nicethemer

Reputation: 1111

Regex works strange in Python 2.7

This is part of Scrapy parser function, in loop i am extracting text and searching for some string in text. This code finds random items, i mean it parse but when i check manually there is no matchings in text - very strange, because if no match it should not go inside "if" and append it:

for site in set(sites):
    if (re.findall(r'(Обам*)', " ".join(site.xpath('div/div/div').extract()), re.IGNORECASE)) !=None:
        item['Target'] = unicode('Obama')
        item['Label'] = unicode(" ".join(site.xpath('h3/a').extract()))
        items.append(item)

When i use another approach

len(re.search(r'(Обам*)', " ".join(site.xpath('div/div/div').extract()), re.IGNORECASE)) !=0:

it simply doesn't work at all, no any parsed items, but i am sure it should find. If i remove "len" counting - it start work, but again randomly (!!!).

By the way if i use simple string function string.find() it works fine.

Edit: This is example Input (it's hard to exactly math input and output so this is some illusrating text, what is worth is that in that text no "Обама" but it saved with match Obama, "Target" : "Obama",):

<div class=\"b-text NahodkiStore-snippet\">\r\n\r\nОни оскорбили <b>Царева</b> не как частного человека, а как выразителя идей Юго-Востока.</div> <div class=\"b-text\"><div>Они оскорбили <b>Клинтон</b>\r\n не как частного человека, а как выразителя идей Юго-Востока. Они ясно \r\nдали понять, какое будущее они готовят русским на Украине.</div>

Output form MongoDB:

{
    "_id" : ObjectId("538fa13abb88b114143d750b"),
    "comment_datesaved" : ISODate("2014-06-05T02:44:01.749Z"),
    "comment_text" : "<div class=\"b-text NahodkiStore-snippet\">\r\n\r\nОни оскорбили <b>Царева</b> не как частного человека, а как выразителя идей Юго-Востока.</div> <div class=\"b-text\"><div>Они оскорбили <b>Обаму</b>\r\n не как частного человека, а как выразителя идей Юго-Востока. Они ясно \r\nдали понять, какое будущее они готовят русским на Украине.</div>",
    "Target" : "Obama",
    "Label" : "<a href=\"http://mikle1.livejournal.com/3907742.html?thread=42077854\" class=\"NahodkiStore-link SearchStatistics-link\" target=\"_blank\">\r\n\r\nОни оскорбили <b>Царева</b> не как частного человека, а как выразителя идей Юго-Востока.</a>",

}

Upvotes: 0

Views: 91

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180461

In [38]: import re

In [39]: s = "a string"

In [40]: re.findall("hello", s)== []
Out[40]: True

In [41]: re.findall("hello", s)==None
Out[41]: False
In [42]: re.findall("hello", s) != None
Out[42]: True

In [43]: re.findall("hello", s) 
Out[43]: []

re.findall returns an empty list not None

You should use:

 `if (re.findall(r'(Обам*)', " ".join(site.xpath('div/div/div').extract()),re.IGNORECASE))`.

Drop the != None or you will always execute the code after the if statement.

In [49]: if re.findall("hello", s ):
             print ("found")
....:     

In [48]: if not re.findall("hello",s):
         print ("not found")
....:     
not found

The same is for re.search except it does return None if it finds no match but you should still just use if re.search(.... without any == or != :

In [64]: re.search("hello", s)!=0
Out[64]: True

In [65]: re.search("hello", s)==0
Out[65]: False

In [66]: re.search("hello", s)==None
Out[66]: True

Using len on a search that returns None will give you a TypeError object of type 'NoneType' has no len()

If your strings are unicode you could specify a list of words and check if they are in your string.

words=[u'Обам' , u'Путин' ,u'OBAMA']
for word in  words:
    if re.search(word.encode("utf-8"), "".join(site.xpath('div/div/div').extract()))):

Upvotes: 3

Related Questions