Python 3.x replacements with Unicode characters

Question

I want to replace line feeds by spaces when the next line begins with a lowercase character.

My code for Python 3.3 is working when the next line begins with [a-z] but fails if it begins with a (lowercase) Unicode character.

Test file (saved as UTF-8): Test to checkwhether it's working.Aquela éárvore pequena.

import os
input_file = 'test.txt'
input_file_path = os.path.join("c:", "\", "Users", "Paulo", "workspace", "pdf_to_text", input_file)
input_string = open(input_file_path).read()

print(input_string)
import re

pattern = r'
([a-zàáâãäåæçčèéêëěìíîïłðñńòóôõöøőřśŝšùúûüůýÿżžÞ]+)'
pattern_obj = re.compile(pattern)
replacement_string = " \1"
output_string = pattern_obj.sub(replacement_string, input_string)
print(output_string)`

jfs · Accepted Answer

... The unicode characters é and á in the original file are changed to Ã© and Ã¡ respectively when I read() the file.

Your actual issue is unrelated to regexes. You are reading utf-8 text using latin-1 encoding that is incorrect.

>>> print("é".encode('utf-8').decode('latin-1'))
Ã©
>>> print("á".encode('utf-8').decode('latin-1'))
Ã¡

To read utf-8 file:

with open(filename, encoding='utf-8') as file:
    text = file.read()

Old answer about regexes (unrelated to OPs issue):

In general, a single user-perceived character such as ç, é may span multiple Unicode codepoints therefore [çé] may match these Unicode codepoints separately instead of matching the whole character. (?:ç|é) would fix this one issue, there are others e.g., Unicode normalization (NFC, NFKD).

I want to replace line feeds by spaces when the next line begins with a lowercase character.

regex module supports POSIX character class [:lower:]:

import regex # $ pip install regex

text = ("Test to check
"
        "whether it's working.
"
        "Aquela \xe9
"
        "\xe1rvore pequena.
")
print(text)
# -> Test to check
# -> whether it's working.
# -> Aquela é
# -> árvore pequena.
print(regex.sub(r'
(?=[[:lower:]])', ' ', text))
# -> Test to check whether it's working.
# -> Aquela é árvore pequena.

To emulate [:lower:] class using re module:

import re
import sys
import unicodedata

# \p{Ll} chars
lower_chars = [u for u in map(chr, range(sys.maxunicode)) 
               if unicodedata.category(u) == 'Ll']
lower = "|".join(map(re.escape, lower_chars))
print(re.sub(r"
(?={lower})".format(lower=lower), ' ', text))

The result is the same.

Python 3.x replacements with Unicode characters

Answers (1)

Old answer about regexes (unrelated to OPs issue):

Related Questions