falconspy
falconspy

Reputation: 790

Use search with regex to find Korean characters using Python

Using Python 2.7.9 on Windows 8.1 Enterprise 64-bit

I'm using the following code to search for any Korean characters ( http://lcweb2.loc.gov/diglib/codetables/9.3.html )

line = ['x'. 'y', 'z', '쭌', 'a']

if any([re.search("[%s-%s]" % ("\xE3\x84\xB1".decode('utf-8'), "\xEC\xAD\x8C".decode('utf-8')), x) for x in line[3:]]):
    print "found character"

When ever I run the script and give it the following character the console shows 쭌 which is a result of IDLE / Command Prompt being unable to show Korean characters I'm guessing.

is the last character that I was hoping to match in the regex

So is the above search correct at least? I'd prefer to know I at least have the right pattern to search for and spend time trying to make the console show the proper Korean characters.

I've tried in command prompt to do cph 1252 and nothing. It never prints out "found character" so I wouldn't ever know.

If it helps, the script is receiving text from an IRC channel where Korean is usually spoken.

Upvotes: 2

Views: 5843

Answers (2)

Preston
Preston

Reputation: 8187

If you wanted to use the regex library (not to be confused with re), you could do this:

import regex
regex.search(r'\p{IsHangul}', '오소리')

or in a function to detect at least one Hangul character:

import regex

def is_hangul(value):
    if regex.search(r'\p{IsHangul}', value):
        return True
    return False

print(is_hangul('오소리'))      # True
print(is_hangul('mushroom'))   # False
print(is_hangul('뱀'))         # True

Upvotes: 5

dlask
dlask

Reputation: 8992

Use Unicode strings (note the "u" prefixes):

import re

line = [u'x', u'y', u'z', u'쭌', u'a']

if any([re.search(u'[\u3131-\ucb4c]', x) for x in line[3:]]):
    print "found character"

Upvotes: 2

Related Questions