Thorsten79
Thorsten79

Reputation: 10128

Generating a histogram from column values in a database

Let's say I have a database column 'grade' like this:

|grade|
|    1|
|    2|
|    1|
|    3|
|    4|
|    5|

Is there a non-trivial way in SQL to generate a histogram like this?

|2,1,1,1,1,0|

where 2 means the grade 1 occurs twice, the 1s mean grades {2..5} occur once and 0 means grade 6 does not occur at all.

I don't mind if the histogram is one row per count.

If that matters, the database is SQL Server accessed by a perl CGI through unixODBC/FreeTDS.

EDIT: Thanks for your quick replies! It is okay if non-existing values (like grade 6 in the example above) do not occur as long as I can make out which histogram value belongs to which grade.

Upvotes: 36

Views: 72527

Answers (9)

Group by ranges and consider empty ranges

This is an extension to: https://stackoverflow.com/a/41275222/895245 which also creates bins for the empty ranges with 0 entries in them:

select x, sum(cnt) from (
  select floor(x/5)*5 as x,
         count(*) as cnt
    from t
    group by 1
  union
  select *, 0 as cnt from generate_series(0, 15, 5)
)
group by x

Tested with:

f="tmp.sqlite"
rm -f "$f"
sqlite3 "$f" 'create table t(x integer)'
sqlite3 "$f" <<'EOF'
insert into t values
  (0),
  (2),
  (2),
  (3),

  (5),
  (6),
  (6),
  (8),
  (9),

  (17)
EOF
sqlite3 "$f" <<EOF
select x, sum(cnt) from (
  select floor(x/5)*5 as x,
         count(*) as cnt
    from t
    group by 1
  union
  select *, 0 as cnt from generate_series(0, 15, 5)
)
group by x
EOF

Output:

0|4
5|5
10|0
15|1

so we see that the bin that starts at value 10 gets an explicit 0 value as desired.

The trick is to create a range of zeroes with generate_series, and then sum it up with the populated ranges. It doesn't alter the counts for populated ranges, but produces the 0 entry.

Multiple range queries are faster on-the-fly floor() calculation

Although using floor() is convenient and self contained, it is likely going to be far slower on a large-ish database if the column is indexed. The reason is likely because in that case SQLite loses the ability to just use the index to count how many values are there in each range, which is a fast operation with an index.

For example, I create a test database with 10m rows:

f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" 'create table t(x integer)'
time sqlite3 "$f" 'insert into t select value as x from generate_series(0,9999999)'
time sqlite3 "$f" 'create index tx on t(x)'

Then, multiple queries with bins of size 1m:

bin=1000000
time (
i=0
while [ $i -lt 10 ]; do
  start="$((i * $bin))"
  printf "${start}|"
  sqlite3 10m.sqlite "select count(*) from t where x >= $start and x < $(((i + 1) * $bin))"
  i=$((i + 1))
done
)

finishes in 0.28s with output:

0|1000000
1000000|1000000
2000000|1000000
3000000|1000000
4000000|1000000
5000000|1000000
6000000|1000000
7000000|1000000
8000000|1000000
9000000|1000000

Doing floor however:

time sqlite3 10m.sqlite <<EOF
select floor(x/1000000)*1000000 as x,
       count(*) as cnt
from t
group by 1
order by 1
EOF

takes 1.7 seconds!

Calculated column overkill

I'm not sure if there's any concrete advantage of using this over just doing multiple queries. But it does allow you to get the result efficiently on a single query.

We will create a calculated column with the floor, index it, and just use that on the GROUP BY:

f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" <<EOF
create table t(
  x integer,
  x_floor integer generated always as (floor(x/1000000)*1000000) STORED
)
EOF
time sqlite3 "$f" 'insert into t select value as x from generate_series(0,9999999)'
time sqlite3 "$f" 'create index tx_floor on t(x_floor)'
time sqlite3 10m.sqlite <<EOF
select x_floor, count(*) as cnt
from t
group by x_floor
order by x_floor
EOF

Runtime: 0.48s. Hmmm, so it was slower than just doing multiple queries. Weird. One theory is that this is because we've made the database 2x as large with the new stored column, and that's creating an IO cost.

Tested on Ubuntu 24.10, SQLite 3.46.1, Lenovo ThinkPad P14s.

Upvotes: 1

Umang Bhardwaj
Umang Bhardwaj

Reputation: 1

SELECT FLOOR(grade/5.00)*5 As Grade_Lower, 
FLOOR(grade/5.00)*5+5 As Grade_Upper
       COUNT(*) AS [Grade Count]
FROM TableName
GROUP BY FLOOR(Grade/5.00)*5, FLOOR(grade/5.00)*5+5
ORDER BY 1

Video tutorial if you like

https://www.youtube.com/watch?v=ioc-NU4meu8

Upvotes: 0

Devon
Devon

Reputation: 1093

According to Shlomo Priymak's article How to Quickly Create a Histogram in MySQL, you can use the following query:

SELECT grade, 
       COUNT(*) AS 'Count',
       RPAD('', COUNT(*), '*') AS 'Bar' 
FROM grades 
GROUP BY grade

Which will produce the following table:

grade   Count   Bar
1       2       **
2       1       *
3       1       *
4       1       *
5       1       *

Upvotes: 5

KyleMit
KyleMit

Reputation: 30097

If there are a lot of data points, you can also group ranges together like this:

SELECT FLOOR(grade/5.00)*5 As Grade, 
       COUNT(*) AS [Grade Count]
FROM TableName
GROUP BY FLOOR(Grade/5.00)*5
ORDER BY 1

Additionally, if you wanted to label the full range, you can get the floor and ceiling ahead of time with a CTE.

With GradeRanges As (
  SELECT FLOOR(Score/5.00)*5     As GradeFloor, 
         FLOOR(Score/5.00)*5 + 4 As GradeCeiling
  FROM TableName
)
SELECT GradeFloor,
       CONCAT(GradeFloor, ' to ', GradeCeiling) AS GradeRange,
       COUNT(*) AS [Grade Count]
FROM GradeRanges
GROUP BY GradeFloor, CONCAT(GradeFloor, ' to ', GradeCeiling)
ORDER BY GradeFloor

Note: In some SQL engines, you can GROUP BY an Ordinal Column Index, but with MS SQL, if you want it in the SELECT statement, you're going to need to group by it also, hence copying the Range into the Group Expression as well.

Option 2: You could use case statements to selectively count values into arbitrary bins and then unpivot them to get a row by row count of included values

Upvotes: 24

Sylvain Ayrault
Sylvain Ayrault

Reputation: 11

I am building on what Ilya Volodin did above, that should allow you to select a range of grade you want to group together in your result:

DECLARE @cnt INT = 0;

WHILE @cnt < 100 -- Set max value
BEGIN
SELECT @cnt,COUNT(fe) FROM dbo.GEODATA_CB where fe >= @cnt-0.999 and fe <= @cnt+0.999 -- set tolerance
SET @cnt = @cnt + 1; -- set step
END;

Upvotes: 0

cjk
cjk

Reputation: 46465

Use a temp table to get your missing values:

CREATE TABLE #tmp(num int)
DECLARE @num int
SET @num = 0
WHILE @num < 10
BEGIN
  INSERT #tmp @num
  SET @num = @num + 1
END


SELECT t.num as [Grade], count(g.Grade) FROM gradeTable g
RIGHT JOIN #tmp t on g.Grade = t.num
GROUP by t.num
ORDER BY 1

Upvotes: 7

MatBailie
MatBailie

Reputation: 86765

Gamecat's use of DISTINCT seems a little odd to me, will have to try it out when I'm back in the office...

The way I would do it is similar though...

SELECT
    [table].grade        AS [grade],
    COUNT(*)             AS [occurances]
FROM
    [table]
GROUP BY
    [table].grade
ORDER BY
    [table].grade

To overcome the lack of data where there are 0 occurances, you can LEFT JOIN on to a table containing all valid grades. The COUNT(*) will count NULLS, but COUNT(grade) won't count the NULLS.

DECLARE @grades TABLE (
   val INT
   )  

INSERT INTO @grades VALUES (1)  
INSERT INTO @grades VALUES (2)  
INSERT INTO @grades VALUES (3)  
INSERT INTO @grades VALUES (4)  
INSERT INTO @grades VALUES (5)  
INSERT INTO @grades VALUES (6)  

SELECT
    [grades].val         AS [grade],
    COUNT([table].grade) AS [occurances]
FROM
    @grades   AS [grades]
LEFT JOIN
    [table]
        ON [table].grade = [grades].val
GROUP BY
    [grades].val
ORDER BY
    [grades].val

Upvotes: 4

Seibar
Seibar

Reputation: 70273

select Grade, count(Grade)
from MyTable
group by Grade

Upvotes: 3

Ilya Volodin
Ilya Volodin

Reputation: 11256

SELECT COUNT(grade) FROM table GROUP BY grade ORDER BY grade

Haven't verified it, but it should work.It will not, however, show count for 6s grade, since it's not present in the table at all...

Upvotes: 43

Related Questions