TechnoSam
TechnoSam

Reputation: 578

PostgreSQL pattern matching with Unicode graphemes

Is there any way to pattern match with Unicode graphemes?

As a quick example, when I run this query:

CREATE TABLE test (
    id SERIAL NOT NULL, 
    name VARCHAR NOT NULL,
    PRIMARY KEY (id), 
    UNIQUE (name)
);
INSERT INTO test (name) VALUES ('πŸ‘πŸ» One');
INSERT INTO test (name) VALUES ('πŸ‘ Two');

SELECT * FROM public.test WHERE test.name LIKE 'πŸ‘%';

I get both rows returned, rather than just 'πŸ‘ Two'. Postgres seems to be just comparing code points, but I want it to compare full graphemes, so it should only match 'πŸ‘ Two', because πŸ‘πŸ» is a different grapheme.

Is this possible?

Upvotes: 4

Views: 310

Answers (1)

S-Man
S-Man

Reputation: 23676

It's a very interesting question!

I am not quite sure if it is possible anyway:

The skinned emojis are, in fact, two joined characters (like ligatures). The first character is the yellow hand πŸ‘ which is followed by an emoji skin modifier 🏻

This is how the light skinned hand is stored internally. So, for me, your result makes sense:

When you query any string, that begins with πŸ‘, it will return:

  1. πŸ‘ Two (trivial)
  2. πŸ‘_🏻 One (ignore the underscore, I try to suppress the automated ligature with this)

So, you can see, the light skinned emoji internally also starts with πŸ‘. That's why I believe, that your query doesn't work the way you like.

Workarounds/Solutions:

  1. You can add a space to your query. This ensures, that there's no skin modifier after your πŸ‘ character. Naturally, this only works in your case, where all data sets have a space after the hand:

    SELECT * FROM test WHERE name LIKE 'πŸ‘ %';
    
  2. You can simply extend the WHERE clause like this:

    SELECT * FROM test 
    WHERE name LIKE 'πŸ‘%'
        AND name NOT LIKE 'πŸ‘πŸ»%'
        AND name NOT LIKE 'πŸ‘πŸΌ%'
        AND name NOT LIKE 'πŸ‘πŸ½%'
        AND name NOT LIKE 'πŸ‘πŸΎ%'
        AND name NOT LIKE 'πŸ‘πŸΏ%'
    
  3. You can use regular expression pattern matching to exclude the skins:

    SELECT * FROM test 
    WHERE name  ~ '^πŸ‘[^🏻🏼🏽🏾🏿]*$'
    

see demo:db<>fiddle (note that the fiddle seems not to provide automated ligatures, so both characters are separated displayed there)

Upvotes: 3

Related Questions