Reputation: 1504

Work with Chinese in Python

I`m trying to work with Chinese text and big data in Python. Part of work is clean text from some unneeded data. For this goal I am using regexes. However I met some problems as in Python regex as in PyCharm application:

1) The data is stored in postgresql and viewed well in the columns, however, after select and pull it to the var it is displayed as a square:

enter image description here When the value printed to the console is looks like:

Mentholatum 曼秀雷敦男士深层活炭洁面乳100g（新包装）

So I presume there is no problem with application encoding but with debug part of encoding, however, I did not find any solutions for such behaviour.

2) The example of regex that I need to care is to remove the values between Chinese brackets include them. The code I used is:

#!/usr/bin/env python
# -*- coding: utf-8 -*

import re
from pprint import pprint 
import sys, locale, os

    columnString = row[columnName]
    startFrom = valuestoremove["startsTo"]
    endWith = valuestoremove["endsAt"]
    isInclude = valuestoremove["include"]
    escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
    nonASCIIregex = re.compile('([^\x00-\x7F])')
    if escapeCharsRegex.match(startFrom):
        startFrom = re.escape(startFrom)
    if escapeCharsRegex.match(endWith):
        endWith = re.escape(endWith)

    if isInclude:
        regex = startFrom + '(.*)' + endWith
    else:
        regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')'
    if nonASCIIregex.match(regex):
        p = re.compile(ur'' + regex)
    else:
        p = re.compile(regex)
    row[columnName] = p.sub("", columnString).strip()

But the regex does not influence on the given string. I`ve made a test with next code:

#!/usr/bin/env python
# -*- coding: utf-8 -*
import re

reg = re.compile(ur'（(.*)）')
string = u"巴黎欧莱雅 男士 劲能冰爽洁面啫哩（原男士劲能净爽洁面啫哩）100ml"
print string
string = reg.sub("", string)
print string

And it is work fine for me. The only difference between those two code examples is that n the first the regex values are come from the txt file with json, encoded as utf-8:

{
                "between": {
                    "startsTo": "(",
                    "endsAt": "）",
                    "include": true,
                    "sequenceID": "1"
                }
            }, {
                "between": {
                    "startsTo": "（",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "（",
                    "endsAt": "）",
                    "include": true,
                    "sequenceID": "2"
                }
            }

The Chinese brackets from the file are also viewed like the squares:

enter image description here

I cant find explanation or any solution for such behavior, thus the community help need

Thanks for help.

Upvotes: 1

Work with Chinese in Python

Answers (2)

Related Questions