Reputation: 6442
I want to remove all non-ASCII characters except the Unicode emoticons from a text file. I am using following command which will remove all non-ASCII characters.
perl -i.bak -pe 's/[^[:ascii:]]//g'
Can this command be modified which will exclude emoticon characters?
EDIT:
Sample input: Good morning! #Happy #StPatricksDay ♣♥😊
Sample output: Good morning! #Happy #StPatricksDay 😊
Upvotes: 2
Views: 2739
Reputation: 3093
Just extend the characters you want to exclude to include the emoticons:
perl -i.bak -pe 's/[^[:ascii:]\p{block:Emoticons}\N{U+2639}\N{U+263A}\N{U+263B}]//g'
After a lot of messing around and trying different switches I found a combination that works with the \p{block...} and \N{U+xxxx} type of regexes.
perl -CS -pe 's/[^[:ascii:]\p{block:emoticons}\N{U+2639}-\N{U+263B}]//g'
Do note that your text has to be in utf-8 for this to work (at least on my cygwin setup).
Upvotes: 0
Reputation: 785481
You can specify range in Perl like this:
s='Good morning! #Happy #StPatricksDay ♣♥😊'
echo "$s" | perl -C -pe 's/[^[:ascii:]\x{1F600}-\x{1F64F}]+//g'
Good morning! #Happy #StPatricksDay 😊
Reference: Unicode block for emoticons
Upvotes: 3