Reputation: 4625
I have a simple table BIRDCOUNT below, showing how many birds were counted on any given day:
+----------+
| NUMBIRDS |
+----------+
| 123 |
| 573 |
| 3 |
| 234 |
+----------+
I would like to create a frequency distribution graph, showing how many times a number of birds were counted. So I need MySQL to create something like:
+------------+-------------+
| BIRD_COUNT | TIMES_SEEN |
+------------+-------------+
| 0-99 | 17 |
| 100-299 | 23 |
| 200-399 | 12 |
| 300-499 | 122 |
| 400-599 | 3 |
+------------+-------------+
If the bird count ranges were fixed this would be easy. However, I never know the min/max of how many birds were seen. So I need a select statement that:
I don't know if #2 is possible in a single select but can anyone solve #1?
Upvotes: 10
Views: 7714
Reputation: 1600
Building upon @gustek answer and Wikipedia's Histogram page, here are a couple of solutions using Scott's Rule and Rice Rule to dynamically set the bin width h using the formula for number of bins,
# Histogram generator using Scott's rule, width(h) = (max - min) / k
SELECT any_value(FLOOR(r2.value / stat.width) * stat.width) as range_start,
count(r2.value) as times_seen,
FROM RESULT r2,
(
select 3.49 * stddev(r.value) / (power(count(*), 1 / 3)) as width
from RESULT r
) as stat
GROUP BY FLOOR(r2.value / stat.width);
# Histogram using Rice rule k = ceil(2*n^1/3), width(h) = (max - min) / k
SELECT any_value(FLOOR(r2.value / stat.width) * stat.width) as range_start,
count(r2.value) as times_seen,
FROM RESULT r2,
(
select (max(r.value) - min(r.value)) / ceil(2 * power(count(*), 1 / 3)) as width
from RESULT r
) as stat
GROUP BY FLOOR(r2.value / stat.width);
The any_value()
function is used to work around new MySQL ONLY_FULL_GROUP_BY
issue.
Upvotes: 1
Reputation: 3760
SELECT
FLOOR( birds.bird_count / stat.diff ) * stat.diff as range_start,
(FLOOR( birds.bird_count / stat.diff ) +1) * stat.diff -1 as range_end,
count( birds.bird_count ) as times_seen
FROM birds_table birds,
(SELECT
ROUND((MAX( bird_count ) - MIN( bird_count ))/10) AS diff
FROM birds_table
) AS stat
GROUP BY FLOOR( birds.bird_count / stat.diff )
Here You have answer for both of Your questions ;] with difference that start and end of range are in separate columns instead of concatenated but if You need it in one column I guess You can do it from here. To change number of ranges just edit number 10 You can find in sub-query.
Upvotes: 8
Reputation: 2075
When creating something like this, GROUP BY, is your friend. The basic idea is to put each value into a bucket, and then count the number of elements in each bucket. To create a bucket, you define a function that takes the value and compute a unique value for the bucket.
Something like this:
SELECT
@low := TRUNCATE(bird_count/100, 0) * 100 as Low,
TRUNCATE(@low + 99, 0) as High,
COUNT(*) AS Count
FROM birds_seen
GROUP BY Low;
In this case, you define a function that take the bird count, and compute the lower range of the bucket. You then group all the values on the lower range, which will place, for example, 123 and 145 into the bucket labelled "100", and 234 and 246 into the bucket labelled "200".
Now, each value is placed in a bucket, and you can group the values by the bucket label, and count the number of elements in each bucket.
Upvotes: 3
Reputation: 446
I'm guess at your actual SQL query:
SELECT dateColumn, COUNT(*) AS NUMBIRDS
FROM birdTable
GROUP BY dateColumn
If so, all you have to do is "bin" your counts:
SELECT CONCAT_WS('-',
FLOOR( NUMBIRDS/100 )*100,
((FLOOR( NUMBIRDS/100 )+1)*100) - 1
) AS BIRD_COUNT
,COUNT(*) AS TIMES_SEEN
FROM (
SELECT dateColumn, COUNT(*) AS NUMBIRDS
FROM birdTable
GROUP BY dateColumn
) AS birdCounts
GROUP BY BIRD_COUNT
Granted, if one of the ranges is missing, you'll won't get a matching row -- but you can easily solve that with a LEFT JOIN if that's an issue.
Upvotes: 1