Dónal
Dónal

Reputation: 187379

get value distribution via SQL

In a MySQL database there's a table with a single numeric value column. I want to plot the distribution of these values as a bar chart. The width of each bar in the chart should be uniform and the height of each bar should reflect the number of entries in this range.

For example, if the range of values is from -10..90 and the number of ranges (bars) is 10, then the first bar in the chart should show the number of entries in the range -10..0 and the last should show the number of entries in the range 80..90.

Is there a better way to calculate the bar chart data than the following:

  1. Get min and max with select min(value), max(value) from mytable
  2. Calculate the upper and lower bounds of each range (in application code)
  3. Get the number of values in each range with select count(*) from mytable where value between X and Y

A problem with this simplistic approach is that a separate query will need to be issued to find the number of values in each range, is it possible to make this more efficient, e.g. by building up the query dynamically and using a GROUP BY?

Upvotes: 1

Views: 2992

Answers (3)

Strawberry
Strawberry

Reputation: 33945

SELECT * FROM my_table;
+----+-----+
| id | val |
+----+-----+
|  1 |  19 |
|  2 |  10 |
|  3 |   6 |
|  4 |  29 |
|  6 |  27 |
|  7 |  20 |
|  8 |  11 |
|  9 |  12 |
| 13 |  16 |
| 14 |  38 |
| 15 |   8 |
| 16 |  22 |
| 17 |  23 |
| 18 |  16 |
| 19 |  20 |
| 20 |  18 |
| 28 |  18 |
| 29 |   7 |
| 30 |  10 |
| 31 |  34 |
| 32 |  11 |
| 33 |  17 |
| 34 |  15 |
| 35 |  12 |
| 36 |  19 |
| 37 |  15 |
| 38 |  18 |
| 39 |  24 |
| 40 |  10 |
| 41 |  19 |
| 42 |  25 |
| 43 |  22 |
| 59 |  27 |
| 60 |  14 |
| 61 |  17 |
| 62 |  25 |
| 63 |  28 |
| 64 |  20 |
| 65 |  10 |
| 66 |  18 |
| 67 |  22 |
| 68 |  12 |
| 69 |  18 |
| 70 |  31 |
| 71 |  22 |
| 72 |  23 |
| 73 |  20 |
| 74 |  12 |
| 75 |  10 |
| 76 |   9 |
| 77 |  21 |
| 78 |  18 |
| 79 |  14 |
| 80 |  17 |
| 81 |   7 |
| 82 |  21 |
| 83 |  11 |
| 84 |  16 |
| 85 |  19 |
| 86 |  30 |
| 87 |  11 |
| 88 |  18 |
| 89 |  26 |
| 90 |  16 |
+----+-----+

SELECT CEILING(((1+val-(SELECT MIN(val) FROM my_table)))/ROUND((SELECT MAX(val)-MIN(val) FROM my_table)/8)) x
     , COUNT(*)
     , GROUP_CONCAT(LPAD(val,2,0) ORDER BY val) n
  FROM my_table
 GROUP
    BY CEILING(((1+val-(SELECT MIN(val) FROM my_table)))/ROUND((SELECT MAX(val)-MIN(val) FROM my_table)/8)); 
+------+----------+----------------------------------------------------+
| x    | COUNT(*) | n                                                  |
+------+----------+----------------------------------------------------+
|    1 |        5 | 06,07,07,08,09                                     |
|    2 |       13 | 10,10,10,10,10,11,11,11,11,12,12,12,12             |
|    3 |       11 | 14,14,15,15,16,16,16,16,17,17,17                   |
|    4 |       17 | 18,18,18,18,18,18,18,19,19,19,19,20,20,20,20,21,21 |
|    5 |        9 | 22,22,22,22,23,23,24,25,25                         |
|    6 |        5 | 26,27,27,28,29                                     |
|    7 |        2 | 30,31                                              |
|    8 |        1 | 34                                                 |
|    9 |        1 | 38                                                 |
+------+----------+----------------------------------------------------+

Upvotes: 1

Luke
Luke

Reputation: 93

If I understand your question correctly, the ranges are unknown and you must determine each dynamically. Find the min and max values as you have done. Using a cursor, calculate the upper and lower bounds for each range and save the results to a temporary table containing fields such as rangeLabel, lowerBound, upperBound. Select the temp table and look up the counts with a subquery:

SELECT rangeLabel, (SELECT Count(*) FROM mytable 
    WHERE value BETWEEN temp.lowerBound AND temp.Upperbound) as myCount
FROM temp

Upvotes: 0

user3741598
user3741598

Reputation: 295

SELECT CASE
WHEN (max(numb) - min(numb) BETWEEN xx AND yy THEN zz
WHEN (max(numb) - min(numb) BETWEEN zz AND aa THEN cc

will handle your sizing of your bar chart and a GROUP BY in another query will give you the values in each distribution.

Upvotes: 0

Related Questions