tktktk0711
tktktk0711

Reputation: 1694

python: extract the emoticon text from Japanese twitter text with regex

I took the following regex to extract the emoticon text from the Japanese twitter with python.

// this is for extracting number, character, Japanese
text2 = r'[0-9A-Za-zぁ-んァ-ン一-龥]'  

non_text = r'[^0-9A-Za-zぁ-んァ-ン一-龥]'
// this is for extracting text that are allowed in Japanese emoticons
allow_text = r'[ovっつ゜ニノ三二]'
hw_kana = r'[ヲ-゚]'
open_branket = r'[\(∩ (]'
close_branket = r'[\)∩ )]'
arround_face = r'(?:' + non_text + '|' + allow_text + ')*'
face = r'(?!(?:' + text2 + '|' + hw_kana + '){3,}).{3,}'
face_string = arround_face + open_branket + face + close_branket +    
              arround_face
p_face = re.compile(face_string)

string1 = 'ふう。お腹いっぱい( ´•౪•`), 試験頑張るぞ\\\\ ٩( ‘ω’ )و ////'
string2 = '心の相談は メール [email protected] までご連絡ください'
string3 = 'ドーピング系浪人生(n=1)'
string4 = '横浜は関内にある「 BAY らっきょ 」に初訪問してまいりました関東スープカレーブームの火付け役となったお店の「 人気NO.1 チキンカレー 」をいただきました(´∀`人)'
string5 = '鳥取県倉吉市   倉吉農業高校  3年食品科 (音楽部・茶道部)    AKB48大ファン高校生!まゆゆ、中野郁海ちゃん神推し    m0326w。♥。・゚♡゚・。♥。i0820n~現在♥大好きだよ♥       AKBファンの方はフォローお願いします^-^  \n\n来春から新社会人・・・の予定(´・ω・`)   '
string6 = 'うわ。。(-_-;)授業。運動会はなくなると?'
string7 = '毎月泊まっちゃえ♡親孝行*\(^o^)/*でも出来る時しとかないとだよ(o^^o)'

emoj1 = p_face.findall(string1)
emoj2 = p_face.findall(string2)
emoj3 = p_face.findall(string3)
emoj4 = p_face.findall(string4)
emoj5 = p_face.findall(string5)
emoj6 = p_face.findall(string6)
emoj7 = p_face.findall(string6)


print(emoj1)
print(emoj2)
print(emoj3)
print(emoj4)
print(emoj5)
print(emoj6)
print(emoj7)

but the result is following:

1.  ['( ´•౪•`), 試験頑張るぞ\\\\ ٩( ‘ω’ )و']
2.  ['\u3000メール\u3000']
3.  ['(n=1)']
4.  ['「\u3000BAY\u3000'] 
5.  ['(´・ω・`)   ']
6.  ['。。(-_-;)']

But there are some issues: The string1, actually there are two emoticons:

    ( ´•౪•`) and \\\\ ٩( ‘ω’ )و ////

but the result just show one emoticon which two emoticons together with other Japanese text. I just want to the following list included two emoticons:

[ '( ´•౪•`)',' \\\\ ٩( ‘ω’ )و ////']

secondly, the string5 actually the ♥。・゚♡゚・。♥。 and ^-^ are also emoticon, but these emotion can't be extracted by the answered regex.

In addition, there is no emoticon text( メール  and (n=1) ['「 BAY ']are not emoticons) in the string2, string3 and string 4. but the regex pattern has extracted these text. Could you give me you hand how to solve this, thanks! please check the Japanese emoticon: http://kaomojiya.com/kao/?other/line

Upvotes: 1

Views: 913

Answers (1)

Thomas W.
Thomas W.

Reputation: 450

The following regular expression should match what you want

expr = '[^0-9A-Za-zぁ-んァ-ン一-龥ovっつ゜ニノ三二]*'       +  // [1]
           '[\(∩ (]'                                    +  // [2]
               '[^0-9A-Za-zぁ-んァ-ン一-龥ヲ-゚\)∩ )]*'    +  // [3]
           '[\)∩ )]'                                    +  // [4]
        '[^0-9A-Za-zぁ-んァ-ン一-龥ovっつ゜ニノ三二]*'         // [5]

You cant try it here.

It starts by matching potential special characters (anything except numbers, romaji, hiragana, katakana and kanji, plus special kanas) [1] as you do. Then, it matches what you called open_branket [2], and then any non kanji, non number, etc. and non close_branket [3]. Finally, it matched the end of the emoji the same way you do with [4] and [5]


EDIT

string4 = ...
string5 = ...

The problem with string4 is that the characters BAY differ from BAY. The seconds one are the usual ASCII characters 0x42, 0x41 and 0x59 while the first ones are unicode characters between 0xff21 and 0xff3a. You can just add them to the list of rejected characters ([3]). You might also want to add their lower case version from (0xff41) to (0xff5a) as well as the corresponding digits, from 0xff10 to 0xff19 You might be interested by reading this page about fullwidth and halfwidth.

The problem with string5 is that those emoji do not contain any open/close character as you defined them. For the first emoji, this can be solved by adding to the list of opening character if this is acceptable. However, it doesn't solve the problem of ^-^.

I would suggest to change the strategy. Something that looks to work not too bad is to choose a set of common characters that appears in usual text (let's call it C) and a subset of C that might appear in emojis (let's call it S) and a number x. Then you can build the following regular expression :

(?:C*)(?P<match>(?:[^C]|S){x,})(?:C*)

This expression will match a "regular" text in a non capturing group followed by a captured sequence of at least x "non regular" characters or characters from your subset S captured in a group named match followed by any "regular" text non captured.

Investigating the unicode table, I defined C as the following set

\u4e00-\u9fff      => CJK Unified Ideographs
\u3400-\u4dbf      => CJK Unified Ideographs Extension A
\uf900-\ufaff      => CJK Compatibility Ideographs
\u3040-\u309f      => Hiragana
\u30a0-\u30ff      => Katakana
\u3000-\u303f      => "CJK Symbol and punctuation"
\uff21-\uff3a      => fullwidth A to Z
\uff41-\uff5a      => fullwidth a to z
\uff10-\uff19      => Fullwidth 0 to 9
\uff00-\uff0e      => Fullwidth form of some punctuation characters
A-Z                => ASCII A to Z
a-z                => ASCII a to z
0-9                => ASCII numbers
@.,;!\?  ~♥\     => other punctuation characters

And S as [人・;皿。゜°うぅ] and x to 3 but you need to inspect more the set of japanese emoji to refine it.

For more information

This lead to the following regular expression

(?:[\u4E00-\u9FFF\u3400-\u4DBF\uF900-\ufaff\u3040-\u309f\u30a0-\u30ff\u3000-\u303f\uff21-\uff3a\uff41-\uff5a\uff10-\uff19\uff00-\uff0eA-Za-z0-9@.,;!\?  ~♥\\]*)(?P<match>(?:[^\u4E00-\u9FFF\u3400-\u380f\uF900-\ufaff\u3040-\u309f\u30a0-\u30ff\u3000-\u303f\uff21-\uff3a\uff41-\uff5a\uff10-\uff19A-Za-z0-9\r\n]|[人・;皿。゜°うぅ]){3,})(?:[\u4E00-\u9FFF\u3400-\u4DBF\uF900-\ufaff\u3040-\u309f\u30a0-\u30ff\u3000-\u303f\uff21-\uff3a\uff41-\uff5a\uff10-\uff19A-Za-z0-9@.,;  ~♥\\]*)

As a conclusion I would say it's not really possible to match every Japanese emoji with one regular expression as they don't follow any well defined pattern. Moreover, they look to include and sometimes end with regular text. For example (。´-д-)疲れた。。 taken from your link. Another solution like a database of emojis might be interesting to investigate

Upvotes: 2

Related Questions