NLTK created string regex not working

Question

I'm trying to do a regex match for a string I got from NLTK. I have a stock class with a method that gets 10k's from edgar and downloads them to a string using NLTK as such.

def get_raw_10ks(self):
                for file in self.files_10k:
                        data = self.__get_data_from_url(file)
                        raw = nltk.clean_html(data)
                        self.raw_10ks.append(raw)

Then, in my program itself, I have

stock.get_raw_10ks()
matchObj = re.match("Indicates", stock.raw_10ks[0])
print matchObj.group()

I get the error

print matchObj.group()
AttributeError: 'NoneType' object has no attribute 'group'

However, when I check the type of stock.raw_10ks[0], it is a string, and when I print it out, one of the last lines is "Indicates management compensatory plan", so I'm not sure what's wrong. I checked that re and nltk are imported correctly.

falsetru · Accepted Answer

re.match() matches the pattern at the beginning of the input string. You should use re.search() instead.

# match()
>>> re.match('Indicates', 'Indicates management compensatory')
<_sre.SRE_Match object at 0x0000000002CC8100>
>>> re.match('Indicates', 'This Indicates management compensatory')

# search()
>>> re.search('Indicates', 'This Indicates management compensatory')
<_sre.SRE_Match object at 0x0000000002CC8168>

See search() vs match().

To make the program robust check the return value of the call:

matchObj = re.search("Indicates", stock.raw_10ks[0])
if matchObj is not None: # OR  if matchObj:
    print matchObj.group()
else:
    print 'No match found.'

BTW, if you want to check Indicates is in the string, using in operator is more preferable:

>>> 'Indicates' in 'This Indicates management compensatory'
True
>>> 'Indicates' in 'This management compensatory'
False

NLTK created string regex not working

Answers (1)

Related Questions