Reputation: 41
I have a large data set in a Postgres db and need to generate a field that groups rows into a respective bin for "0-100", "101-200", "201-300", etc. all the way up to nearly 5000. I am aware that I could manually update each row and produce a line of code for each bin like this:
update test
set testgroup = '0-100' where testint >= 1 and distance < 100;
I really would like to figure out a more efficient way to do this, open to anything and everything! The main goal is to look at the integer in this 'testint' column and then if it is in between 1-100 return in the testgroup column "0-100".
Upvotes: 1
Views: 6664
Reputation: 2967
all the way up to nearly 5000
Sometimes the issue is that the upper bound is unknown so width_bucket
may not be ideal in such case as the upper bound is required.
But an old school modulo may be enough:
-- Explicit
SELECT testint - testint % 100 || '-' || testint - testint % 100 + 100
FROM (
VALUES
(256),
(543),
(33),
(5611)
) AS q (testint)
-- Less duplicate operations
SELECT left_end || '-' || left_end + 100
FROM (
VALUES
(256),
(543),
(33),
(5611)
) AS q (testint)
JOIN LATERAL (SELECT testint - testint % 100) l(left_end) ON TRUE;
Both return:
200-300
500-600
0-100
5600-5700
Upvotes: 0
Reputation: 5805
Use the width_bucket
function. See the the docs, but here is a short version of the syntax:
width_bucket(a, LBound, UBound, num_bins)
To get it to work properly for your bins, I have to add 1 to UBound. Some examples:
select width_bucket( 1, 0, 5001, 50)
gives 1
select width_bucket(100, 0, 5001, 0)
gives 1
select width_bucket(101, 0, 5001, 50)
gives 2
select width_bucket(4900, 0, 5001, 50)
gives 49
select width_bucket(4901, 0, 5001, 50)
gives 50
So that works as expected. Next we need to generate the proper string. Pseudo format is
(width_bucket - 1)*100 || '-' || (width_bucket)*100
Where || is the SQL concatenation operator. Using the first example from before:
select (width_bucket(1, 0, 5001, 50)-1)*100 || ' - ' || width_bucket(1, 0, 5001, 50)*100
gives '0 - 100'
Sweet. Now putting it all together. First make a sandbox table you can use for testing. This will be a copy or partial copy of your data:
CREATE TABLE test
AS
SELECT *
FROM original_table
Then add the new column to the table:
ALTER TABLE test
ADD COLUMN testgroup text
Now the UPDATE statement:
UPDATE test
SET testgroup = width_bucket(testint, 0, 5001, 50)-1)*100 || ' - ' ||
width_bucket(testint, 0, 5001, 50)*100
Upvotes: 4
Reputation: 17846
You can make use of generate_series
to generate numbers from 0 to 50, and then to select the data between the generated values * 100 and the next generated value * 100. The same principle is used to build the bin name.
UPDATE test
SET testgroup = (x*100)+1 || '-' || (x+1)*100
FROM generate_series(0,50) f(x)
WHERE testint > (x*100)
AND testint <= ((x+1)*100);
http://rextester.com/FXIS37706
Upvotes: 0