mrbolichi
mrbolichi

Reputation: 627

Regex to delete emojis from string

I have a list of the Unicode emojis and I want to strip the emojis from them (i.e just want the whole first part and the name at the end of the row). A sample rows are like these ones:

1F468 1F3FD 200D 2695 FE0F   ; fully-qualified # 👨🏽‍⚕️ man health worker: medium skin tone
1F469 1F3FF 200D 2695        ; non-fully-qualified # 👩🏿‍⚕ woman health worker: dark skin tone

(from where I have deleted some spaces for the sake of simplicity). What I want is to match is the [non-]fully-qualified part as well as the # and the emoji, so I can delete them with sed. I have tried the following regex

 sed -e 's/\<[on-]*fully-qualified\># *.+?(?=[a-zA-Z]) //g' 

which tries to match the words [non-]fully-qualified a space, the # symbol, and then whatever you can find (non-greedy) until the first letter, and replace it with an empty string.

I would like to have this output:

1F468 1F3FD 200D 2695 FE0F   ; man health worker: medium skin tone
1F469 1F3FF 200D 2695        ; woman health worker: dark skin tone

I have tried several posted answers to no avail, and besides, I'm trying to match a pattern between two boundaries which is were I'm having the trouble

EDIT: I'm trying to run the command in the git bash shipped with git for windows

Upvotes: 0

Views: 2056

Answers (2)

MauricioRobayo
MauricioRobayo

Reputation: 2356

I'm still not pretty sure, but this might work:

sed 's/;.*fully-qualified\s*#[^a-zA-Z]*/; /'

This will replace anything that is a semicolon ;, followed by any character .*, followed by the "fully-qualified" text, followed by any number of spaces, followed by a hashtag, followed by any character that is not a-zA-Z [^a-zA-Z], and replace all that with a semicolon followed by a space.

To be sure that the [a-zA-Z] captures only a to z and A to Z without any other characters, which seems to be the problem, a quick fix just for that command could be to use LC_ALL=C:

LC_ALL=C sed 's/;.*fully-qualified\s*#[^a-zA-Z]*/; /' file

Upvotes: 1

Charles Srstka
Charles Srstka

Reputation: 17050

I like to search for what I actually want and then keep it.

This works on OS X in my testing:

sed -E 's/^([^#]+)#[^a-zA-Z\s]*(.*)$/\1 # \2/g'

EDIT: I don't have the Windows version of sed to try, but maybe this will work. Not as precise, but short and simple.

sed -e 's/#\s*[^a-zA-Z\s]*/# /g'

EDIT AGAIN: My bad, I read the question again and you wanted to delete more than just the emoji. This one should do it.

sed -e 's/;[^#]*#\s*[^a-zA-Z\s]*/; /g'

Upvotes: 1

Related Questions