Reputation: 361
(I want to perform a group-by based on the distinct values in a string column that has multiple values
The said column has a list of strings in a standard format separated by commas. The potential values are only a,b,c,d
.
For example the column collection
(type: String) contains:
Row 1: ["a","b"]
Row 2: ["b","c"]
Row 3: ["b","c","a"]
Row 4: ["d"]`
The expected output is a count of unique values:
collection | count
a | 2
b | 3
c | 2
d | 1
Upvotes: 1
Views: 708
Reputation: 5124
What you need to do is to first explode the collection column into separate rows (like a flatMap
operation). In redshift the only way to generate new rows is to JOIN
- so let's CROSS JOIN
your input table with a static table having consecutive numbers, and take only ones having id
less or equal to number of elements in the collection. Then we'll use split_part
function to read the item at correct index. Once we have the exploaded table, we'll do a simple GROUP BY
.
If your items are stored as JSON array strings ('["a", "b", "c"]'
) then you can use JSON_ARRAY_LENGTH
and JSON_EXTRACT_ARRAY_ELEMENT_TEXT
instead of REGEXP_COUNT
and SPLIT_PART
respectively.
with
index as (
select 1 as i
union all select 2
union all select 3
union all select 4 -- could be substituted with 'select row_number() over () as i from arbitrary_table limit 4'
),
agg as (
select 'a,b' as collection
union all select 'b,c'
union all select 'b,c,a'
union all select 'd'
)
select
split_part(collection, ',', i) as item,
count(*)
from index,agg
where regexp_count(agg.collection, ',') + 1 >= index.i -- only get rows where number of items matches
group by 1
Upvotes: 1
Reputation: 829
For all the below i used this table:
create table tmp (
id INT auto_increment,
test VARCHAR(255),
PRIMARY KEY (id)
);
insert into tmp (test) values
("a,b"),
("b,c"),
("b,c,a"),
("d")
;
If the possible values are only a,b,c,d
you can try one of this:
Tke note that this will only works if you have not so similar values like test
and test_new
, because then the test
would be joined also with all test_new
rows and the count would not match
select collection, COUNT(*) as count from tmp JOIN (
select CONCAT("%", tb.collection, "%") as like_collection, collection from (
select "a" COLLATE utf8_general_ci as collection
union select "b" COLLATE utf8_general_ci as collection
union select "c" COLLATE utf8_general_ci as collection
union select "d" COLLATE utf8_general_ci as collection
) tb
) tb1
ON tmp.test LIKE tb1.like_collection
GROUP BY tb1.collection;
Which will give you the result you want
collection | count
a | 2
b | 3
c | 2
d | 1
or you can try this one
SELECT
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%a%') as a_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%b%') as b_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%c%') as c_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%d%') as d_count
;
The result would be like this
a_count | b_count | c_count | d_count
2 | 3 | 2 | 1
Upvotes: 1