Reputation: 99

Count number of unique characters in a string

I'm looking for a sql statement to count the number of unique characters in a string.

e.g.

3333333333 -> returns 1
1113333333 -> returns 2
1112222444 -> returns 3

I did some tests with REGEX and mysql-string-functions, but I didn't find a solution.

Upvotes: 8

Answers (7)

Sagenius

Reputation: 1

There isn't a single statement that I know of that will count the distinct characters in a string, but the coded lines below can achieve the desired outcome. The value below stored in @passedstr represent the string to be counted. The result of the query contains a number in the 'cnt' column that represents the number of distinct characters in the string. Examples below are written for MySQL and SQL Server. You could use this code idea in a function to have one statement that performs the count.

MySQL:

declare @passedstr as nvarchar(4000) = 'abbcdeffffff';

with recursive ccount (idx, pstr) AS (
    select 1, substring(@passedstr, 1, 1) union all
    select idx + 1, substring(@passedstr, idx + 1, 1) from ccount where idx < length(@passedstr))

select count(distinct (pstr)) cnt from ccount;

SQL Server:

declare @passedstr as nvarchar(4000) = 'abbcdeffffff';

with ccount as (
    select 1 idx, substring(@passedstr, 1, 1) pstr union all
    select idx + 1, substring(@passedstr, idx + 1, 1) from ccount where idx < len(@passedstr))

select count(distinct pstr) cnt from ccount

Upvotes: 0

Adge Cutler

Reputation: 34

There's a few levels of subquery which may put off some, and it would need to be extended for columns with longer strings, but by turning it on its side using UNPIVOT its quite straightforward.

declare @Data table (RowID nvarchar(1), StringData nvarchar(10))
insert into @Data values (N'1', N'3333333333'),(N'2', N'1113333333'),(N'3', N'1112222444')

select  t1.StringData, cast(t2.CharCount as nvarchar) as 'Unique Characters in String'
from    @Data t1
        inner join (
            select  RowID,count(*) as 'CharCount'
            from    (
                    select  distinct RowID, [char]
                    from    (
                        select  RowID,
                            substring(StringData,1,1) as '1',
                            substring(StringData,2,1) as '2',
                            substring(StringData,3,1) as '3',
                            substring(StringData,4,1) as '4',
                            substring(StringData,5,1) as '5',
                            substring(StringData,6,1) as '6',
                            substring(StringData,7,1) as '7',
                            substring(StringData,8,1) as '8',
                            substring(StringData,9,1) as '9',
                            substring(StringData,10,1) as '10'
                        from    @Data
                        ) Unpivd
                    unpivot ( [char] for chars in ([1],[2],[3],[4],[5],[6],[7],[8],[9],[10])) unpiv
                    where [char] <> ''
                ) CharCounter
            group by RowID
            ) t2
            on t2.RowID = t1.RowID

This returns:

StringData  Unique Characters in String
3333333333  1
1113333333  2
1112222444  3

Upvotes: 0

Ryan Vincent

Reputation: 4513

This is for fun right?

SQL is all about processing sets of rows, so if we can convert a 'word' into a set of characters as rows then we can use the 'group' functions to do useful stuff.

Using a 'relational database engine' to do simple character manipulation feels wrong. Still, is it possible to answer your question with just SQL? Yes it is...

Now, i always have a table that has one integer column that has about 500 rows in it that has the ascending sequence 1 .. 500. It is called 'integerseries'. It is a really small table that used a lot so it gets cached in memory. It is designed to replace the from 'select 1 ... union ... text in queries.

It is useful for generating sequential rows (a table) of anything that you can calculate that is based on a integer by using it in a cross join (also any inner join). I use it for generating days for a year, parsing comma delimited strings etc.

Now, the sql mid function can be used to return the character at a given position. By using the 'integerseries' table i can 'easily' convert a 'word' into a characters table with one row per character. Then use the 'group' functions...

SET @word='Hello World';

SELECT charAtIdx, COUNT(charAtIdx)
FROM (SELECT charIdx.id,
    MID(@word, charIdx.id, 1) AS charAtIdx 
    FROM integerseries AS charIdx
    WHERE charIdx.id <= LENGTH(@word)
    ORDER BY charIdx.id ASC
    ) wordLetters
GROUP BY
   wordLetters.charAtIdx
ORDER BY charAtIdx ASC

Output:

charAtIdx  count(charAtIdx)  
---------  ------------------
                            1
d                           1
e                           1
H                           1
l                           3
o                           2
r                           1
W                           1

Note: The number of rows in the output is the number of different characters in the string. So, if the number of output rows is counted then the number of 'different letters' will be known.

This observation is used in the final query.

The final query:

The interesting point here is to move the 'integerseries' 'cross join' restrictions (1 .. length(word)) into the actual 'join' rather than do it in the where clause. This provides the optimizer with clues as to how to restrict the data produced when doing the join.

SELECT 
   wordLetterCounts.wordId,
   wordLetterCounts.word,   
   COUNT(wordLetterCounts.wordId) AS letterCount
FROM 
     (SELECT words.id AS wordId,
             words.word AS word,
             iseq.id AS charPos,
             MID(words.word, iseq.id, 1) AS charAtPos,
             COUNT(MID(words.word, iseq.id, 1)) AS charAtPosCount
     FROM
          words
          JOIN integerseries AS iseq
               ON iseq.id BETWEEN 1 AND words.wordlen 
      GROUP BY
            words.id,
            MID(words.word, iseq.id, 1)
      ) AS wordLetterCounts
GROUP BY
   wordLetterCounts.wordId

Output:

wordId  word                  letterCount  
------  --------------------  -------------
     1  3333333333                        1
     2  1113333333                        2
     3  1112222444                        3
     4  Hello World                       8
     5  funny - not so much?             13

Word Table and Data:

CREATE TABLE `words` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(128) COLLATE utf8mb4_unicode_ci NOT NULL,
  `wordlen` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

/*Data for the table `words` */

insert  into `words`(`id`,`word`,`wordlen`) values (1,'3333333333',10);
insert  into `words`(`id`,`word`,`wordlen`) values (2,'1113333333',10);
insert  into `words`(`id`,`word`,`wordlen`) values (3,'1112222444',10);
insert  into `words`(`id`,`word`,`wordlen`) values (4,'Hello World',11);
insert  into `words`(`id`,`word`,`wordlen`) values (5,'funny - not so much?',20);

Integerseries table: range 1 .. 30 for this example.

CREATE TABLE `integerseries` (
  `id` int(11) unsigned NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=500 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Upvotes: 8

pushOk

Reputation: 290

DROP FUNCTION IF EXISTS test.count_chrs;
CREATE DEFINER=`test`@`localhost` FUNCTION `count_chrs`(s CHAR(100)) RETURNS CHAR(4)
  BEGIN 
    DECLARE string_length int(4);
    DECLARE unique_string CHAR(100) DEFAULT "";
    DECLARE count_unique int(4) DEFAULT 0;
    DECLARE current_char int(4) DEFAULT 1;
    SET string_length = CHAR_LENGTH(s);

    WHILE current_char <= string_length DO
      IF (!LOCATE(SUBSTR(s, current_char, 1), unique_string)) THEN
        SET count_unique = count_unique + 1;
        SET unique_string = CONCAT(unique_string, SUBSTR(s, current_char, 1));
      END IF;

      SET current_char = current_char + 1;
    END WHILE;

    RETURN count_unique; 
  END;

I'm very newbie with MySQL functions declaration, but this could turn you to the right way.

Upvotes: 0

pala_

Reputation: 9010

One thing you can do is have a table of all your characters, such as:

mysql> select * from chars;
+----+------+
| id | c    |
+----+------+
|  1 | 1    |
|  2 | 2    |
|  3 | 3    |
|  4 | 4    |
+----+------+

If your table of words looks like this:

mysql> select * from words;
+----+-----------+
| id | word      |
+----+-----------+
|  1 | 111222333 |
|  2 | 11111111  |
|  3 | 2222111   |
|  4 | 5555555   |
+----+-----------+

You can then join these tables on the condition of the character being in the word, and get the count, like this:

mysql> select word, count(c) from words w inner join chars c on locate(c.c, word) group by word;
+-----------+----------+
| word      | count(c) |
+-----------+----------+
| 11111111  |        1 |
| 111222333 |        3 |
| 2222111   |        2 |
+-----------+----------+

Upvotes: 3

Abhik Chakraborty

Reputation: 44874

There is no direct or easy way of doing it. You may need to write a store function to do the job and by looking at all the characters you may expect in the data. Here is an example for just digits , which could be extended for all the characters in a stored function

mysql> select * from test ;
+------------+
| val        |
+------------+
| 11111111   |
| 111222222  |
| 1113333222 |
+------------+


select 
val, 
sum(case when locate('1',val) > 0 then 1 else 0 end ) 
+ sum( case when locate('2',val) > 0 then 1 else 0 end)
+ sum(case when locate('3',val) > 0 then 1 else 0 end)
+sum(case when locate('4',val) > 0 then 1 else 0 end ) as occurence
from test group by val


+------------+-----------+
| val        | occurence |
+------------+-----------+
| 11111111   |         1 |
| 111222222  |         2 |
| 1113333222 |         3 |
+------------+-----------+

Or if you have enough time , create a lookup table with all the characters you could think of. And make the query in 2 lines

mysql> select * from test ;
+------------+
| val        |
+------------+
| 11111111   |
| 111222222  |
| 1113333222 |
+------------+
3 rows in set (0.00 sec)

mysql> select * from look_up ;
+------+------+
| id   | val  |
+------+------+
|    1 | 1    |
|    2 | 2    |
|    3 | 3    |
|    4 | 4    |
+------+------+
4 rows in set (0.00 sec)

select 
t1.val, 
sum(case when locate(t2.val,t1.val) > 0 then 1 else 0 end ) as occ 
from test t1,(select * from look_up)t2 
group by t1.val ;

+------------+------+
| val        | occ  |
+------------+------+
| 11111111   |    1 |
| 111222222  |    2 |
| 1113333222 |    3 |
+------------+------+

Upvotes: 4

Marcelo

Reputation: 429

I think this is not a job for Mysql, but you could do anything if you try hard enough ;)

I don't like this answers but it works, and it's not too ugly if you have only numbers

SELECT 
    CASE WHEN yourcolumn LIKE '%1%' THEN 1 ELSE 0 END + 
    CASE WHEN yourcolumn LIKE '%2%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%3%' THEN 1 ELSE 0 END + 
    CASE WHEN yourcolumn LIKE '%4%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%5%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%6%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%7%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%8%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%9%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%0%' THEN 1 ELSE 0 END
FROM yourtable

Upvotes: 0

Count number of unique characters in a string

Answers (7)

Related Questions