Sehael
Sehael

Reputation: 3736

postgresql aggregate by max string length

I have a one to many relationship. In this case, it's a pipelines entity that can have many segments. The segments entity has a column to list the wells associated with this pipeline. This column is purely informational, and is only updated from a regulatory source as a comma separated list, so the data type is text.

What I want to do is to list all the pipelines and show the segment column that has the most associated wells. Each well is identified with a standardized land location (text is the same length for each well). I am also doing other aggregate functions on the segments, so my query looks something like this (I have to simplify it because it's pretty large):

SELECT pipelines.*, max(segments.associated_wells), min(segments.days_without_production), max(segments.production_water_m3)
FROM pipelines
JOIN segments ON segments.pipeline_id = pipelines.id
GROUP BY pipelines.id

This selects the associated_wells that has the highest alphabetical value, which makes sense, but is not what I want.

max(length(segments.associated_wells)) will select the record I want, but only show the length. I need to show the column value.

How can I aggregate based on the string length but show the value?

Here's an example of what I am expecting:

Segment entity:

| id | pipeline_id | associated_wells         | days_without_production | production_water_m3 |
|----|-------------|--------------------------|-------------------------|---------------------|
| 1  | 1           | 'location1', 'location2' | 30                      | 2.3                 |
| 2  | 1           | 'location1'              | 15                      | 1.4                 |
| 3  | 2           | 'location1'              | 20                      | 1.8                 |

Pipeline entity:

| id | name        |
|----|-------------|
| 1  | 'Pipeline1' |
| 2  | 'Pipeline2' |
|    |             |

Desired Query Result:

| id | name        | associated_wells         | days_without_production | production_water_m3 |
|----|-------------|--------------------------|-------------------------|---------------------|
| 1  | 'Pipeline1' | 'location1', 'location2' | 15                      | 2.3                 |
| 2  | 'Pipeline2' | 'location1'              | 20                      | 1.8                 |
|    |             |                          |                         |                     |

Upvotes: 1

Views: 1602

Answers (2)

marcothesane
marcothesane

Reputation: 6721

Keep normalising and verticalise the locations/associated wells, by cross joining with a series of integers, and then group twice:

WITH
segment(seg_id,pipeline_id,associated_wells,days_without_production,production_water_m3) AS (
          SELECT 1,1,'location1, location2',30,2.3
UNION ALL SELECT 2,1,'location1',15,1.4
UNION ALL SELECT 3,2,'location1',20,1.8
)
,
pipeline(pipeline_id,name) AS (
          SELECT 1,'Pipeline1'
UNION ALL SELECT 2,'Pipeline2'
)
,
i(i) AS (
          SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
)
,
location AS (
  SELECT
    seg_id
  , i AS loc_id
  , SPLIT_PART(associated_wells,', ',i) AS location
  FROM segment CROSS JOIN i
  WHERE SPLIT_PART(associated_wells,',' ,i) <>''
)
,
pregroup AS (
  SELECT
    segment.pipeline_id
  , location.location
  , MIN(days_without_production) AS days_without_production
  , MAX(production_water_m3)     AS production_water_m3
  FROM segment
  JOIN pipeline USING(pipeline_id)
  JOIN location USING(seg_id)
  GROUP BY 1,2 
)
SELECT
  pipeline_id
, STRING_AGG(location,',')     AS locations
, MIN(days_without_production) AS days_without_production
, MAX(production_water_m3)     AS production_water_m3
FROM pregroup
GROUP BY 1;
 pipeline_id |      locations      | days_without_production | production_water_m3 
-------------+---------------------+-------------------------+---------------------
           1 | location1,location2 |                      15 |                 2.3
           2 | location1           |                      20 |                 1.8

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269443

If I understand correctly, you want DISTINCT ON:

SELECT DISTINCT ON (p.id) p.*, s.*
FROM pipelines p JOIN
     segments s
     ON s.pipeline_id = p.id
ORDER BY p.id, LENGTH(s.associated_wells) DESC;

Upvotes: 1

Related Questions