cyril
cyril

Reputation: 3006

Python: find a series of Chinese characters within a string and apply a function

I've got a series of text that is mostly English, but contains some phrases with Chinese characters. Here's two examples:

s1 = "You say: 你好. I say: 再見"
s2 = "答案, my friend, 在風在吹"

I'm trying to find each block of Chinese, apply a function which will translate the text (I already have a way to do the translation), then replace the translated text in the string. So the output would be something like this:

o1 = "You say: hello. I say: goodbye"
o2 = "The answer, my friend, is blowing in the wind"

I can find the Chinese characters easily by doing this:

utf_line = s1.decode('utf-8') 
re.findall(ur'[\u4e00-\u9fff]+',utf_line)

...But I end up with a list of all the Chinese characters and no way of determining where each phrase begins and ends.

Upvotes: 4

Views: 4660

Answers (4)

Mariano
Mariano

Reputation: 6511

You can't get the indexes using re.findall(). You could use re.finditer() instead, and refer to m.group(), m.start() and m.end().

However, for your particular case, it seems more practical to call a function using re.sub().

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string

Code:

import re

s = "You say: 你好. I say: 再見. 答案, my friend, 在風在吹"
utf_line = s.decode('utf-8')

dict = {"你好" : "hello",
        "再見" : "goodbye",
        "答案" : "The answer",
        "在風在吹" : "is blowing in the wind",
       }

def translate(m):
    block = m.group().encode('utf-8')
    # Do your translation here

    # this is just an example
    if block in dict:
        return dict[ block ]
    else:
        return "{unknown}"


utf_translated = re.sub(ur'[\u4e00-\u9fff]+', translate, utf_line, re.UNICODE)

print utf_translated.encode('utf-8')

Output:

You say: hello. I say: goodbye. The answer, my friend, is blowing in the wind

Upvotes: 3

Ashan
Ashan

Reputation: 317

You could always use a in-place replace of the matched regular expression by using re.sub() in python.

Try this:

print(re.sub(r'([\u4e00-\u9fff]+)', translate('\g<0>'), utf_line))

Upvotes: 7

tdelaney
tdelaney

Reputation: 77367

Regular expression Match objects give you the start and end indexes of a match. So, instead of findall, do your own search and record the indexes as you go. Then, you can translate each extent and replace in the string based on the known indexes of the phrases.

import re

_scan_chinese_re = re.compile(r'[\u4e00-\u9fff]+')

s1 = "You say: 你好. I say: 再見"
s2 = "答案, my friend, 在風在吹"

def translator(chinese_text):
    """My no good translator"""
    return ' '.join('??' for _ in chinese_text)

def scanner(text):
    """Scan text string, translate chinese and return copy"""
    print('----> text:', text)

    # list of extents where chinese text is found
    chinese_inserts = [] # [start, end]

    # keep scanning text to end
    index = 0
    while index < len(text):
        m = _scan_chinese_re.search(text[index:])
        if not m:
            break
        # get extent from match object and add to list
        start = index + m.start()
        end = index + m.end()
        print('chinese at index', start, text[start:end])
        chinese_inserts.append([start, end])
        index += end

    # copy string and replace backwards so we don't mess up indexes
    copy = list(text)
    while chinese_inserts:
        start, end = chinese_inserts.pop()
        copy[start:end] = translator(text[start:end])
    text = ''.join(copy)
    print('final', text)
    return text

scanner(s1)
scanner(s2)

With my questionable translator, the result is

----> text: You say: 你好. I say: 再見
chinese at index 9 你好
chinese at index 20 再見
final You say: ?? ??. I say: ?? ??
----> text: 答案, my friend, 在風在吹
chinese at index 0 答案
chinese at index 15 在風在吹
final ?? ??, my friend, ?? ?? ?? ??

Upvotes: 0

Francisco
Francisco

Reputation: 11496

A possible solution is to capture everything, but in different capture groups, so you can differentiate later if they're in Chinese or not.

ret = re.findall(ur'([\u4e00-\u9fff]+)|([^\u4e00-\u9fff]+)', utf_line)
result = []
for match in ret:
    if match[0]:
        result.append(translate(match[0]))
    else:
        result.append(match[1])

print(''.join(result))

Upvotes: 2

Related Questions