Reputation: 12794

SQL aggregate function for any non-specific value from a group

Is there an agregate function that returns any value from a group. I could use MIN or MAX, but would rather avoid the overhead if possible given it's a text field.

My situation is an error log summary. The errors are grouped by the type of error and an example of the error text is displayed for each group. It doesn't matter which error message is used as the example.

SELECT
    ref_code,
    log_type,
    error_number,
    COUNT(*) AS count,
    MIN(data) AS example
FROM data
GROUP BY
    ref_code,
    log_type,
    error_number

What can I replace MIN(data) with to not have to compare 100,000s of varchar(2000) values?

Upvotes: 8

Answers (3)

DWright

Reputation: 9500

Well, since you asked about OVER PARTITION AND ORDER BY, below is a version that does your GROUP BY, but then also uses ROW_NUMBER() with OVER and PARTITION AND ORDER BY, to number the first ref_code, log_type, error_num combination it comes across as row number 1 (with whatever data column is there at 1). Then it renumbers, starting at 1, at the next distinct ref_code, log_type, error_num combination it finds (with whatever data column that happens to be there). So you can then simply pull the data field at row number 1 as a representative data field for a given ref_code, log_type, error_num.

It's still lacking something. It would be more elegant if I didn't have the double pass (once for aggregation and once for row_number()); however, it might perform very well none-the-less. I'll have to think about it some more to see if I can eliminate the double pass.

But it avoids any comparison of the large data field. And it is represents a way to do what you asked: to pull 1 representative sample from the data field in correlation with the aggregated fields.

SELECT
    t.ref_code,
    t.log_type,
    t.error_number,
    t.count,
    d.data
FROM
(
    SELECT
        ref_code,
        log_type,
        error_number,
        COUNT(*) as count
    FROM data
    GROUP BY
        ref_code,
        log_type,
        error_number
) t
INNER JOIN 
(
    SELECT
        ref_code,
        log_type,
        error_number,
        data,
        ROW_NUMBER() OVER
        (
            PARTITION BY
                ref_code,
                log_type,
                error_number
            ORDER BY
                ref_code,
                log_type,
                error_number
        ) as row_number
    FROM data
) d on
    d.ref_code = t.ref_code and
    d.log_type = t.log_type and
    d.error_number = t.error_number and
    row_number = 1

Final caveat: I don't have Oracle to try this on. But I did put it together from reading Oracle documentation.

I added the below after I thought further how to elminate the GROUP BY, which I only had in there for COUNT(*). Don't know if it's any faster though.

SELECT *
FROM
(
    SELECT
        ref_code,
        log_type,
        error_number,
        data,
        ROW_NUMBER() OVER
        (
            PARTITION BY
                ref_code,
                log_type,
                error_number
            ORDER BY
                ref_code,
                log_type,
                error_number
        ) as row_number,
        COUNT(*) OVER
        (
            PARTITION BY
                ref_code,
                log_type,
                error_number
            ORDER BY
                ref_code,
                log_type,
                error_number
        ) as count 

    FROM data
) t
WHERE row_number = 1

Upvotes: 2

Hand-E-Food

Reputation: 12794

Going by the proposed answers, it appears that MIN(data) (or MAX(data)) is the fastest way to achieve what I want. I'm trying to over-optimise unnecessarily.

I'll try out any other answers that come up while I have access to this database, but in the mean time, this comes out on top.

Thank you for everyone's effort!

Upvotes: 3

cha

Reputation: 10411

you can use MIN coupled with KEEP, like this:

MIN(data) keep (dense_rank first order by rowid) AS EXAMPLE

The idea behind this is that the database engine will be sorting data over ROWID instead of the VARCHAR(2000) values, which theoretically should be faster. You can replace ROWID with the primary key value, and check if it's faster

Upvotes: 4

SQL aggregate function for any non-specific value from a group

Answers (3)

Related Questions