Reputation: 39

PostgreSQL select most common combination of two columns

I need help with manipulating a table in PostgreSQL.

I have a table with three columns and more than 30k rows. Many of the values repeat themselves in the lines.

I must write a SELECT statement that outputs the most frequent values of the first column, the most frequent values of the second column and the most common combination of both columns. The whole thing must be grouped by the third column.

What I've tried:

SELECT * 
FROM 
    (SELECT column1 AS " Most frequent1", COUNT(column1) AS "occurrence"
     FROM table_name
     GROUP BY column1
     ORDER BY occurrence DESC
     LIMIT 1),
    (SELECT column2 AS "Most frequent2", COUNT(column2) AS "occurrence"
     FROM table_name
     GROUP BY column2
     ORDER BY occurrence DESC
     LIMIT 1),
    (SELECT CONCAT(column1, column2) AS "kombiniert", COUNT(CONCAT(column1, column2)) AS "occurrence"
     FROM table_name
     GROUP BY kombiniert
     ORDER BY occurrence DESC
     LIMIT 1);

How can I group the whole thing by column3?

Is there a better method?

Upvotes: 0

Answers (3)

Felipe Funes

Reputation: 81

Actually, if you want to get the mode of two columns you can use the sum operator. Something like this:

MODE() WITHIN GROUP (ORDER BY (col1 + col2)) as mode_col_1_and_2

EDITED: After reading the comments and a lot of testing, I regret of this answer. @a_horse_with_no_name came with the right way:

MODE() WITHIN GROUP (ORDER BY (col1, col2)) as mode_col_1_and_2

Upvotes: 0

Thorsten Kettner

Reputation: 94874

What you are looking for is MODE, the statistical mode = the most frequent value.

SELECT
  column3,
  MODE() WITHIN GROUP (ORDER BY column1) AS most_frequent_column1,
  MODE() WITHIN GROUP (ORDER BY column2) AS most_frequent_column2,
  MODE() WITHIN GROUP (ORDER BY column1 || column2) AS most_frequent_pair
FROM table_name
GROUP BY column3
ORDER BY column3;

Two points:

MODE ignores nulls. If this is undesired, you must workaround this.
Just glueing the two strings together may or may not be what you are looking for. 'AB' || CD' = ' ABCD'. 'A' || BCD' = ' ABCD'.

For the latter, please see a_horse_with_no_name's comments below. MODE() WITHIN GROUP (ORDER BY (column1, column2)) perfectly solves the ambiguous pairs issue.

Upvotes: 0

Gordon Linoff

Reputation: 1269523

The most common value is called the mode in statistics. This should return the three modes that you are looking for:

select col3,
       max(col1_cnt), max(case when seqnum_1 then col1 end),
       max(col2_cnt), max(case when seqnum_2 then col2 end),
       max(col12_cnt), max(case when seqnum_12 then col1 || ':' || col2 end)
from (select t.*,
             row_number() over (partition by col3 order by col1_cnt desc) as seqnum_1,
             row_number() over (partition by col3 order by col2_cnt desc) as seqnum_2,
             row_number() over (partition by col3 order by col12_cnt desc) as seqnum_12
      from (select col1, col2, col3, count(*) as cnt,
                   sum(count(*)) over (partition by col1) as col1_cnt,
                   sum(count(*)) over (partition by col2) as col2_cnt,
                   sum(count(*)) over (partition by col1, col2) as col1_col2_cnt
            from t
            group by col1, col2, col3
           ) t
     ) t
group by col3;

Upvotes: 2

PostgreSQL select most common combination of two columns

Answers (3)

Related Questions