Reputation: 99
I'm looking for a sql statement to count the number of unique characters in a string.
e.g.
3333333333 -> returns 1
1113333333 -> returns 2
1112222444 -> returns 3
I did some tests with REGEX and mysql-string-functions, but I didn't find a solution.
Upvotes: 8
Views: 6823
Reputation: 1
There isn't a single statement that I know of that will count the distinct characters in a string, but the coded lines below can achieve the desired outcome. The value below stored in @passedstr represent the string to be counted. The result of the query contains a number in the 'cnt' column that represents the number of distinct characters in the string. Examples below are written for MySQL and SQL Server. You could use this code idea in a function to have one statement that performs the count.
MySQL:
declare @passedstr as nvarchar(4000) = 'abbcdeffffff';
with recursive ccount (idx, pstr) AS (
select 1, substring(@passedstr, 1, 1) union all
select idx + 1, substring(@passedstr, idx + 1, 1) from ccount where idx < length(@passedstr))
select count(distinct (pstr)) cnt from ccount;
SQL Server:
declare @passedstr as nvarchar(4000) = 'abbcdeffffff';
with ccount as (
select 1 idx, substring(@passedstr, 1, 1) pstr union all
select idx + 1, substring(@passedstr, idx + 1, 1) from ccount where idx < len(@passedstr))
select count(distinct pstr) cnt from ccount
Upvotes: 0
Reputation: 34
There's a few levels of subquery which may put off some, and it would need to be extended for columns with longer strings, but by turning it on its side using UNPIVOT its quite straightforward.
declare @Data table (RowID nvarchar(1), StringData nvarchar(10))
insert into @Data values (N'1', N'3333333333'),(N'2', N'1113333333'),(N'3', N'1112222444')
select t1.StringData, cast(t2.CharCount as nvarchar) as 'Unique Characters in String'
from @Data t1
inner join (
select RowID,count(*) as 'CharCount'
from (
select distinct RowID, [char]
from (
select RowID,
substring(StringData,1,1) as '1',
substring(StringData,2,1) as '2',
substring(StringData,3,1) as '3',
substring(StringData,4,1) as '4',
substring(StringData,5,1) as '5',
substring(StringData,6,1) as '6',
substring(StringData,7,1) as '7',
substring(StringData,8,1) as '8',
substring(StringData,9,1) as '9',
substring(StringData,10,1) as '10'
from @Data
) Unpivd
unpivot ( [char] for chars in ([1],[2],[3],[4],[5],[6],[7],[8],[9],[10])) unpiv
where [char] <> ''
) CharCounter
group by RowID
) t2
on t2.RowID = t1.RowID
This returns:
StringData Unique Characters in String
3333333333 1
1113333333 2
1112222444 3
Upvotes: 0
Reputation: 4513
This is for fun right?
SQL is all about processing sets of rows, so if we can convert a 'word' into a set of characters as rows then we can use the 'group' functions to do useful stuff.
Using a 'relational database engine' to do simple character manipulation feels wrong. Still, is it possible to answer your question with just SQL? Yes it is...
Now, i always have a table that has one integer column that has about 500 rows in it that has the ascending sequence 1 .. 500. It is called 'integerseries'. It is a really small table that used a lot so it gets cached in memory. It is designed to replace the from 'select 1 ... union ...
text in queries.
It is useful for generating sequential rows (a table) of anything that you can calculate that is based on a integer by using it in a cross join
(also any inner join
). I use it for generating days for a year, parsing comma delimited strings etc.
Now, the sql mid
function can be used to return the character at a given position. By using the 'integerseries' table i can 'easily' convert a 'word' into a characters table with one row per character. Then use the 'group' functions...
SET @word='Hello World';
SELECT charAtIdx, COUNT(charAtIdx)
FROM (SELECT charIdx.id,
MID(@word, charIdx.id, 1) AS charAtIdx
FROM integerseries AS charIdx
WHERE charIdx.id <= LENGTH(@word)
ORDER BY charIdx.id ASC
) wordLetters
GROUP BY
wordLetters.charAtIdx
ORDER BY charAtIdx ASC
Output:
charAtIdx count(charAtIdx)
--------- ------------------
1
d 1
e 1
H 1
l 3
o 2
r 1
W 1
Note: The number of rows in the output is the number of different characters in the string. So, if the number of output rows is counted then the number of 'different letters' will be known.
This observation is used in the final query.
The final query:
The interesting point here is to move the 'integerseries' 'cross join' restrictions (1 .. length(word)) into the actual 'join' rather than do it in the where
clause. This provides the optimizer with clues as to how to restrict the data produced when doing the join
.
SELECT
wordLetterCounts.wordId,
wordLetterCounts.word,
COUNT(wordLetterCounts.wordId) AS letterCount
FROM
(SELECT words.id AS wordId,
words.word AS word,
iseq.id AS charPos,
MID(words.word, iseq.id, 1) AS charAtPos,
COUNT(MID(words.word, iseq.id, 1)) AS charAtPosCount
FROM
words
JOIN integerseries AS iseq
ON iseq.id BETWEEN 1 AND words.wordlen
GROUP BY
words.id,
MID(words.word, iseq.id, 1)
) AS wordLetterCounts
GROUP BY
wordLetterCounts.wordId
Output:
wordId word letterCount
------ -------------------- -------------
1 3333333333 1
2 1113333333 2
3 1112222444 3
4 Hello World 8
5 funny - not so much? 13
Word Table and Data:
CREATE TABLE `words` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`word` varchar(128) COLLATE utf8mb4_unicode_ci NOT NULL,
`wordlen` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
/*Data for the table `words` */
insert into `words`(`id`,`word`,`wordlen`) values (1,'3333333333',10);
insert into `words`(`id`,`word`,`wordlen`) values (2,'1113333333',10);
insert into `words`(`id`,`word`,`wordlen`) values (3,'1112222444',10);
insert into `words`(`id`,`word`,`wordlen`) values (4,'Hello World',11);
insert into `words`(`id`,`word`,`wordlen`) values (5,'funny - not so much?',20);
Integerseries table: range 1 .. 30 for this example.
CREATE TABLE `integerseries` (
`id` int(11) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=500 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Upvotes: 8
Reputation: 290
DROP FUNCTION IF EXISTS test.count_chrs;
CREATE DEFINER=`test`@`localhost` FUNCTION `count_chrs`(s CHAR(100)) RETURNS CHAR(4)
BEGIN
DECLARE string_length int(4);
DECLARE unique_string CHAR(100) DEFAULT "";
DECLARE count_unique int(4) DEFAULT 0;
DECLARE current_char int(4) DEFAULT 1;
SET string_length = CHAR_LENGTH(s);
WHILE current_char <= string_length DO
IF (!LOCATE(SUBSTR(s, current_char, 1), unique_string)) THEN
SET count_unique = count_unique + 1;
SET unique_string = CONCAT(unique_string, SUBSTR(s, current_char, 1));
END IF;
SET current_char = current_char + 1;
END WHILE;
RETURN count_unique;
END;
I'm very newbie with MySQL functions declaration, but this could turn you to the right way.
Upvotes: 0
Reputation: 9010
One thing you can do is have a table of all your characters, such as:
mysql> select * from chars;
+----+------+
| id | c |
+----+------+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
+----+------+
If your table of words looks like this:
mysql> select * from words;
+----+-----------+
| id | word |
+----+-----------+
| 1 | 111222333 |
| 2 | 11111111 |
| 3 | 2222111 |
| 4 | 5555555 |
+----+-----------+
You can then join these tables on the condition of the character being in the word, and get the count, like this:
mysql> select word, count(c) from words w inner join chars c on locate(c.c, word) group by word;
+-----------+----------+
| word | count(c) |
+-----------+----------+
| 11111111 | 1 |
| 111222333 | 3 |
| 2222111 | 2 |
+-----------+----------+
Upvotes: 3
Reputation: 44874
There is no direct or easy way of doing it. You may need to write a store function to do the job and by looking at all the characters you may expect in the data. Here is an example for just digits , which could be extended for all the characters in a stored function
mysql> select * from test ;
+------------+
| val |
+------------+
| 11111111 |
| 111222222 |
| 1113333222 |
+------------+
select
val,
sum(case when locate('1',val) > 0 then 1 else 0 end )
+ sum( case when locate('2',val) > 0 then 1 else 0 end)
+ sum(case when locate('3',val) > 0 then 1 else 0 end)
+sum(case when locate('4',val) > 0 then 1 else 0 end ) as occurence
from test group by val
+------------+-----------+
| val | occurence |
+------------+-----------+
| 11111111 | 1 |
| 111222222 | 2 |
| 1113333222 | 3 |
+------------+-----------+
Or if you have enough time , create a lookup table with all the characters you could think of. And make the query in 2 lines
mysql> select * from test ;
+------------+
| val |
+------------+
| 11111111 |
| 111222222 |
| 1113333222 |
+------------+
3 rows in set (0.00 sec)
mysql> select * from look_up ;
+------+------+
| id | val |
+------+------+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
+------+------+
4 rows in set (0.00 sec)
select
t1.val,
sum(case when locate(t2.val,t1.val) > 0 then 1 else 0 end ) as occ
from test t1,(select * from look_up)t2
group by t1.val ;
+------------+------+
| val | occ |
+------------+------+
| 11111111 | 1 |
| 111222222 | 2 |
| 1113333222 | 3 |
+------------+------+
Upvotes: 4
Reputation: 429
I think this is not a job for Mysql, but you could do anything if you try hard enough ;)
I don't like this answers but it works, and it's not too ugly if you have only numbers
SELECT
CASE WHEN yourcolumn LIKE '%1%' THEN 1 ELSE 0 END +
CASE WHEN yourcolumn LIKE '%2%' THEN 1 ELSE 0 END +
CASE WHEN yourcolumn LIKE '%3%' THEN 1 ELSE 0 END +
CASE WHEN yourcolumn LIKE '%4%' THEN 1 ELSE 0 END +
CASE WHEN yourcolumn LIKE '%5%' THEN 1 ELSE 0 END +
CASE WHEN yourcolumn LIKE '%6%' THEN 1 ELSE 0 END +
CASE WHEN yourcolumn LIKE '%7%' THEN 1 ELSE 0 END +
CASE WHEN yourcolumn LIKE '%8%' THEN 1 ELSE 0 END +
CASE WHEN yourcolumn LIKE '%9%' THEN 1 ELSE 0 END +
CASE WHEN yourcolumn LIKE '%0%' THEN 1 ELSE 0 END
FROM yourtable
Upvotes: 0