python 2.7 weird unicode re.sub behavior

Question

I'm writing a program to normalize csv header columns. It might occur that the column contains a non-English character, so I added the re.UNICODE flag to my re.sub calls.

My code looks like this:

 for i in range(0, len(row)):
    column = row[i]
    column = column.lower()
    column = re.sub('[\W]', '_', column, flags=re.IGNORECASE | re.UNICODE)
    column = re.sub('[_]{2,}', '_', column, flags=re.UNICODE)
    column = column.strip('_')
    print column

In my current scenario I have one column with a non-English character: Printer geïntegreerd. The encoding of the originating file is UTF-8. I'm not writing the result to a file yet, just writing to console.

The column gets converted to: printer_ge�_ntegreerd.

When I leave the re.UNICODE flag, it gets converted to: printer_ge_ntegreerd.

What am I doing wrong here?

Elad · Accepted Answer

I tried using "utf-8" encoding and got the wanted result:

for i in range(0, len(row)):
    column = row[i].decode('utf-8')
    column = column.lower()
    column = re.sub('[\W]', '_', column, flags=re.IGNORECASE | re.UNICODE)
    column = re.sub('[_]{2,}', '_', column, flags=re.UNICODE)
    column = column.strip('_')
    print column

When you read input from a file it's of type str with a specific encoding, when you try to manipulate it using "re" it uses the default decoding "ascii". This is not the correct encoding in your case, the most common encoding is "utf-8" and "latin-1" which transform the variable to type unicode. "re" recognise the characters correctly in variables of type unicode. Note that after the decoding column is of type unicode.

Hope this helps.

python 2.7 weird unicode re.sub behavior

Answers (1)

Related Questions