Paul Gowder
Paul Gowder

Reputation: 2539

python 3 regex not finding confirmed matches

So I'm trying to parse a bunch of citations from a text file using the re module in python 3.4 (on, if it matters, a mac running mavericks). Here's some minimal code. Note that there are two commented lines: they represent two alternative searches. (Obviously, the little one, r'Rawls', is the one that works)

def makeRefList(reffile):
    print(reffile)
    # namepattern = r'(^[A-Z1][A-Za-z1]*-?[A-Za-z1]*),.*( \(?\d\d\d\d[a-z]?[.)])'
    # namepattern = r'Rawls'
    refsTuplesList = re.findall(namepattern, reffile, re.MULTILINE)
    print(refsTuplesList)

The string in question is ugly, and so I stuck it in a gist: https://gist.github.com/paultopia/6c48c398a42d4834f2ae

As noted, the search string r'Rawls' produces expected output ['Rawls', 'Rawls']. However, the other search string just produces an empty list.

I've confirmed this regex (partially) works using the regex101 tester. Confirmation here: https://regex101.com/r/kP4nO0/1 -- this match what I expect it to match. Since it works in the tester, it should work in the code, right?

(n.b. I copied the text from terminal output from the first print command, then manually replaced \n characters in the string with carriage returns for regex101.)

One possible issue is that python has appended the bytecode flag (is the little b called a "flag?") to the string. This is an artifact of my attempt to convert the text from utf-8 to ascii, and I haven't figured out how to make it go away.

Yet re clearly is able to parse strings in that form. I know this because I'm converting two text files from utf-8 to ascii, and the following code works perfectly fine on the other string, converted from the other text file, which also has a little b in front of it:

def makeCiteList(citefile):
    print(citefile)
    citepattern = r'[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]? \(?\d\d\d\d[a-z]?[\s.,)]'
    rawCitelist = re.findall(citepattern, citefile)
    cleanCitelist = cleanup(rawCitelist)
    finalCiteList = list(set(cleanCitelist))
    print(finalCiteList)
    return(finalCiteList)

The other chunk of text, which the code immediately above matches correctly: https://gist.github.com/paultopia/a12eba2752638389b2ee

The only hypothesis I can come up with is that the first, broken, regex expression is puking on the combination of newline characters and the string being treated as a byte object, even though a) I know the regex is correct for newlines (because, confirmation from the linked regex101), and b) I know it's matching the strings (because, confirmation from the successful match on the other string).

If that's true, though, I don't know what to do about it.

Thus, questions:

1) Is my hypothesis right that it's the combination of newlines and b that blows up my regex? If not, what is? 2) How do I fix that?
a) replace the newlines with something in the string? b) rewrite the regex somehow? c) somehow get rid of that b and make it into a normal string again? (how?)

thanks!

Addition

In case this is a problem I need to fix upstream, here's the code I'm using to get the text files and convert to ascii, replacing non-ascii characters:

this function gets called on utf-8 .txt files saved by textwrangler in mavericks

def makeCorpoi(citefile, reffile):
    citebox = open(citefile, 'r')
    refbox = open(reffile, 'r')
    citecorpus = citebox.read()
    refcorpus = refbox.read()
    citebox.close()
    refbox.close()
    corpoi = [str(citecorpus), str(refcorpus)]
    return corpoi

and then this function gets called on each element of the list the above function returns.

def conv2ASCII(bigstring): 
    def convHandler(error):
        return ('1FOREIGN', error.start + 1)
    codecs.register_error('foreign', convHandler)
    bigstring = bigstring.encode('ascii', 'foreign')
    stringstring = str(bigstring)
    return stringstring

Upvotes: 0

Views: 79

Answers (1)

Paul Gowder
Paul Gowder

Reputation: 2539

Aah. I've tracked it down and answered my own question. Apparently one needs to call some kind of encode method on the decoded thing. The following code produces an actual string, with newlines and everything, out the other end (though now I have to fix a bunch of other bugs before I can figure out if the final output is as expected):

def conv2ASCII(bigstring): 
    def convHandler(error):
        return ('1FOREIGN', error.start + 1)
    codecs.register_error('foreign', convHandler)
    bigstring = bigstring.encode('ascii', 'foreign')
    newstring = bigstring.decode('ascii', 'foreign')
    return newstring

apparently the str() function doesn't do the same job, for reasons that are mysterious to me. This is despite an answer here How to make new line commands work in a .txt file opened from the internet? which suggests that it does.

Upvotes: 1

Related Questions