Reputation: 4625

Mysql create freqency distribution

I have a simple table BIRDCOUNT below, showing how many birds were counted on any given day:

+----------+
| NUMBIRDS |
+----------+
| 123      |
| 573      |
| 3        |
| 234      |
+----------+

I would like to create a frequency distribution graph, showing how many times a number of birds were counted. So I need MySQL to create something like:

+------------+-------------+
| BIRD_COUNT | TIMES_SEEN  |
+------------+-------------+
| 0-99       | 17          |
| 100-299    | 23          |
| 200-399    | 12          |
| 300-499    | 122         |
| 400-599    | 3           |
+------------+-------------+

If the bird count ranges were fixed this would be easy. However, I never know the min/max of how many birds were seen. So I need a select statement that:

Creates an output similar to above, always creating 10 ranges of counts.
(more advanced) Creates output similar to above, always creating N ranges of counts.

I don't know if #2 is possible in a single select but can anyone solve #1?

Upvotes: 10

Answers (4)

Jean Marois

Reputation: 1600

Building upon @gustek answer and Wikipedia's Histogram page, here are a couple of solutions using Scott's Rule and Rice Rule to dynamically set the bin width h using the formula for number of bins, $k = \ceil{ (max - min) / h }$

# Histogram generator using Scott's rule, width(h) = (max - min) / k
SELECT any_value(FLOOR(r2.value / stat.width) * stat.width) as range_start,
       count(r2.value)                                      as times_seen,
FROM RESULT r2,
 (
     select 3.49 * stddev(r.value) / (power(count(*), 1 / 3)) as width
     from RESULT r
 ) as stat
GROUP BY FLOOR(r2.value / stat.width);

# Histogram using Rice rule k = ceil(2*n^1/3), width(h) = (max - min) / k
SELECT any_value(FLOOR(r2.value / stat.width) * stat.width) as range_start,
       count(r2.value)                                      as times_seen,
FROM RESULT r2,
 (
     select (max(r.value) - min(r.value)) / ceil(2 * power(count(*), 1 / 3)) as width
     from RESULT r
 ) as stat
GROUP BY FLOOR(r2.value / stat.width);

The any_value() function is used to work around new MySQL ONLY_FULL_GROUP_BY issue.

Upvotes: 1

Gustek

Reputation: 3760

SELECT
    FLOOR( birds.bird_count / stat.diff ) * stat.diff as range_start, 
    (FLOOR( birds.bird_count / stat.diff ) +1) * stat.diff -1 as range_end, 
    count( birds.bird_count ) as times_seen
FROM birds_table birds, 
    (SELECT 
        ROUND((MAX( bird_count ) - MIN( bird_count ))/10) AS diff
    FROM birds_table
    ) AS stat
GROUP BY FLOOR( birds.bird_count / stat.diff )

Here You have answer for both of Your questions ;] with difference that start and end of range are in separate columns instead of concatenated but if You need it in one column I guess You can do it from here. To change number of ranges just edit number 10 You can find in sub-query.

Upvotes: 8

Mats Kindahl

Reputation: 2075

When creating something like this, GROUP BY, is your friend. The basic idea is to put each value into a bucket, and then count the number of elements in each bucket. To create a bucket, you define a function that takes the value and compute a unique value for the bucket.

Something like this:

SELECT
  @low := TRUNCATE(bird_count/100, 0) * 100 as Low,
  TRUNCATE(@low + 99, 0) as High,
  COUNT(*) AS Count
FROM birds_seen
GROUP BY Low;

In this case, you define a function that take the bird count, and compute the lower range of the bucket. You then group all the values on the lower range, which will place, for example, 123 and 145 into the bucket labelled "100", and 234 and 246 into the bucket labelled "200".

Now, each value is placed in a bucket, and you can group the values by the bucket label, and count the number of elements in each bucket.

Upvotes: 3

fenway

Reputation: 446

I'm guess at your actual SQL query:

SELECT dateColumn, COUNT(*) AS NUMBIRDS
FROM birdTable
GROUP BY dateColumn

If so, all you have to do is "bin" your counts:

SELECT CONCAT_WS('-', 
   FLOOR( NUMBIRDS/100 )*100,
   ((FLOOR( NUMBIRDS/100 )+1)*100) - 1
) AS BIRD_COUNT
,COUNT(*) AS TIMES_SEEN
FROM (
    SELECT dateColumn, COUNT(*) AS NUMBIRDS
    FROM birdTable
    GROUP BY dateColumn
) AS birdCounts
GROUP BY BIRD_COUNT

Granted, if one of the ranges is missing, you'll won't get a matching row -- but you can easily solve that with a LEFT JOIN if that's an issue.

Upvotes: 1

Mysql create freqency distribution

Answers (4)

Related Questions