Reputation: 627
I have a list of the Unicode emojis and I want to strip the emojis from them (i.e just want the whole first part and the name at the end of the row). A sample rows are like these ones:
1F468 1F3FD 200D 2695 FE0F ; fully-qualified # 👨🏽⚕️ man health worker: medium skin tone
1F469 1F3FF 200D 2695 ; non-fully-qualified # 👩🏿⚕ woman health worker: dark skin tone
(from where I have deleted some spaces for the sake of simplicity). What I want is to match is the [non-]fully-qualified
part as well as the #
and the emoji, so I can delete them with sed
. I have tried the following regex
sed -e 's/\<[on-]*fully-qualified\># *.+?(?=[a-zA-Z]) //g'
which tries to match the words [non-]fully-qualified
a space, the #
symbol, and then whatever you can find (non-greedy) until the first letter, and replace it with an empty string.
I would like to have this output:
1F468 1F3FD 200D 2695 FE0F ; man health worker: medium skin tone
1F469 1F3FF 200D 2695 ; woman health worker: dark skin tone
I have tried several posted answers to no avail, and besides, I'm trying to match a pattern between two boundaries which is were I'm having the trouble
EDIT: I'm trying to run the command in the git bash shipped with git for windows
Upvotes: 0
Views: 2056
Reputation: 2356
I'm still not pretty sure, but this might work:
sed 's/;.*fully-qualified\s*#[^a-zA-Z]*/; /'
This will replace anything that is a semicolon ;
, followed by any character .*
, followed by the "fully-qualified" text, followed by any number of spaces, followed by a hashtag, followed by any character that is not a-zA-Z [^a-zA-Z]
, and replace all that with a semicolon followed by a space.
To be sure that the [a-zA-Z]
captures only a to z and A to Z without any other characters, which seems to be the problem, a quick fix just for that command could be to use LC_ALL=C
:
LC_ALL=C sed 's/;.*fully-qualified\s*#[^a-zA-Z]*/; /' file
Upvotes: 1
Reputation: 17050
I like to search for what I actually want and then keep it.
This works on OS X in my testing:
sed -E 's/^([^#]+)#[^a-zA-Z\s]*(.*)$/\1 # \2/g'
EDIT: I don't have the Windows version of sed
to try, but maybe this will work. Not as precise, but short and simple.
sed -e 's/#\s*[^a-zA-Z\s]*/# /g'
EDIT AGAIN: My bad, I read the question again and you wanted to delete more than just the emoji. This one should do it.
sed -e 's/;[^#]*#\s*[^a-zA-Z\s]*/; /g'
Upvotes: 1