Ravindra S
Ravindra S

Reputation: 6442

Perl: Remove all non-ascii characters except specific ones

I want to remove all non-ASCII characters except the Unicode emoticons from a text file. I am using following command which will remove all non-ASCII characters.

perl -i.bak -pe 's/[^[:ascii:]]//g'

Can this command be modified which will exclude emoticon characters?

EDIT:

Sample input: Good morning! #Happy #StPatricksDay ♣♥😊

Sample output: Good morning! #Happy #StPatricksDay 😊

Upvotes: 2

Views: 2739

Answers (2)

Peter Bowers
Peter Bowers

Reputation: 3093

Just extend the characters you want to exclude to include the emoticons:

perl -i.bak -pe 's/[^[:ascii:]\p{block:Emoticons}\N{U+2639}\N{U+263A}\N{U+263B}]//g'

Edit

After a lot of messing around and trying different switches I found a combination that works with the \p{block...} and \N{U+xxxx} type of regexes.

 perl -CS -pe 's/[^[:ascii:]\p{block:emoticons}\N{U+2639}-\N{U+263B}]//g'

Do note that your text has to be in utf-8 for this to work (at least on my cygwin setup).

Upvotes: 0

anubhava
anubhava

Reputation: 785481

You can specify range in Perl like this:

s='Good morning! #Happy #StPatricksDay ♣♥😊'

echo "$s" | perl -C -pe 's/[^[:ascii:]\x{1F600}-\x{1F64F}]+//g'
Good morning! #Happy #StPatricksDay 😊

Reference: Unicode block for emoticons

Upvotes: 3

Related Questions