Reputation: 8183
I'm writing a program to normalize csv header columns. It might occur that the column contains a non-English character, so I added the re.UNICODE
flag to my re.sub
calls.
My code looks like this:
for i in range(0, len(row)):
column = row[i]
column = column.lower()
column = re.sub('[\W]', '_', column, flags=re.IGNORECASE | re.UNICODE)
column = re.sub('[_]{2,}', '_', column, flags=re.UNICODE)
column = column.strip('_')
print column
In my current scenario I have one column with a non-English character: Printer geïntegreerd
. The encoding of the originating file is UTF-8. I'm not writing the result to a file yet, just writing to console.
The column gets converted to: printer_ge�_ntegreerd
.
When I leave the re.UNICODE
flag, it gets converted to: printer_ge_ntegreerd
.
What am I doing wrong here?
Upvotes: 0
Views: 45
Reputation: 221
I tried using "utf-8" encoding and got the wanted result:
for i in range(0, len(row)):
column = row[i].decode('utf-8')
column = column.lower()
column = re.sub('[\W]', '_', column, flags=re.IGNORECASE | re.UNICODE)
column = re.sub('[_]{2,}', '_', column, flags=re.UNICODE)
column = column.strip('_')
print column
When you read input from a file it's of type str with a specific encoding, when you try to manipulate it using "re" it uses the default decoding "ascii". This is not the correct encoding in your case, the most common encoding is "utf-8" and "latin-1" which transform the variable to type unicode. "re" recognise the characters correctly in variables of type unicode. Note that after the decoding column is of type unicode.
Hope this helps.
Upvotes: 1