stackq
stackq

Reputation: 491

Excluding specific rows based on row number

I have a dataset which looks like this:

| ReportId  | Method | Status | OrganizationId | StartedAt                     |
|-----------|--------|--------|----------------|-------------------------------|
| 38373bfk8 | Email  | 0      | ABC            | 2022-06-10 00:00:53.794 +0000 |
| 78687fea  | Email  | 0      | XYZ            | 2022-06-10 00:03:51.432 +0000 |
| 48978kd   | Email  | 100    | POD            | 2022-06-10 00:02:45.532 +0000 |
| 38373bfk8 | Email  | 100    | ABC            | 2022-06-10 00:00:22.654 +0000 |
| 86887dhd  | Csv    | 100    | FGH            | 2022-06-10 00:03:12.541 +0000 |
| 78687fea  | Email  | 100    | XYZ            | 2022-06-11 00:04:51.352 +0000 |

In the Status column, 0 indicates a failure and 100 indicates a success for the generation of a Report for an Org.

I want to implement logic that if a certain ReportId/Method/OrganizationId combination fails but the same combination has a successful status later in the same day, then exclude that initial failure. I essentially want to keep all successful rows and all failure rows that didn't have a success later in the same day.

From the above dataset, we would remove the first row as there is a success for that ReportId/Method/Status combination later in the day (row 4). We would keep row two despite it also being a failure, because there is no successful status later in the same day (row 6 is the next day). So all rows would be kept except the first.

I have built a CTE to rank the rows based accordingly:

with Ranked as (
    select
        ReportId,
        Method,
        Status,
        OrganizationId,
        StartedAt,
        row_number() over (partition by ReportId, Method, OrganizationId, cast(StartedAt as date) order by StartedAt asc) as rn
    from
        MyTable
)

Then I filter out the rows I don't want based on the above logic:

ExcludeFirstFailures as (
    select
        ReportId,
        Method,
        Status,
        OrganizationId,
        StartedAt,
        rn
    from
        Ranked
    where
        (Status in 0 and rn >  1) --Keep failures that weren't the first of the day
        or Status = 100 --Keep all successful rows
)

This is close to what I need but the problem is that it is filtering out all failures which were the first of the day, without making a comparison if the same report ran ran successfully later on the same day - only those failures should be excluded.

Upvotes: 0

Views: 330

Answers (2)

Gokhan Atil
Gokhan Atil

Reputation: 10079

Can you try this one?

with successful as (
  select
      ReportId,
      Method,
      Status,
      OrganizationId,
      StartedAt,
      MAX(Status) over (partition by ReportId, Method, OrganizationId, cast(StartedAt as date)) as success
  from
      MyTable
  )
select *
from successful
where
    success = 0  -- all failure rows that didn't have a success later
    or Status = 100 --Keep all successful rows
order by STARTEDAT, REPORTID, status
;

+-----------+--------+--------+----------------+-------------------------------+---------+
| REPORTID  | METHOD | STATUS | ORGANIZATIONID |           STARTEDAT           | SUCCESS |
+-----------+--------+--------+----------------+-------------------------------+---------+
| 38373bfk8 | Email  |    100 | ABC            | 2022-06-10 00:00:22.654 +0000 |     100 |
| 48978kd   | Email  |    100 | POD            | 2022-06-10 00:02:45.532 +0000 |     100 |
| 86887dhd  | Csv    |    100 | FGH            | 2022-06-10 00:03:12.541 +0000 |     100 |
| 78687fea  | Email  |      0 | XYZ            | 2022-06-10 00:03:51.432 +0000 |       0 |
| 78687fea  | Email  |    100 | XYZ            | 2022-06-11 00:04:51.352 +0000 |     100 |
+-----------+--------+--------+----------------+-------------------------------+---------+

78687fea wasn't successful on 2022-06-10.

Extra info:

I'm informed that status code 15 is also counted as successful. If this is the complete list of successful codes, we can still use the MAX approach: Instead of using plain "status" column with MAX, we can use "IFF(Status = 15,'100', Status )" expression, and filter for "Status IN (15,100)".

To be able to test, I set the status to 15 for reports (id: 38373bfk8 and 48978kd).

The output of the modified SQL:

+-----------+--------+--------+----------------+-------------------------------+---------+
| REPORTID  | METHOD | STATUS | ORGANIZATIONID |           STARTEDAT           | SUCCESS |
+-----------+--------+--------+----------------+-------------------------------+---------+
| 38373bfk8 | Email  |     15 | ABC            | 2022-06-10 00:00:22.654 +0000 |     100 |
| 48978kd   | Email  |     15 | POD            | 2022-06-10 00:02:45.532 +0000 |     100 |
| 86887dhd  | Csv    |    100 | FGH            | 2022-06-10 00:03:12.541 +0000 |     100 |
| 78687fea  | Email  |      0 | XYZ            | 2022-06-10 00:03:51.432 +0000 |       0 |
| 78687fea  | Email  |    100 | XYZ            | 2022-06-11 00:04:51.352 +0000 |     100 |
+-----------+--------+--------+----------------+-------------------------------+---------+

Upvotes: 1

Eric Lin
Eric Lin

Reputation: 1520

What about this one?

with report_rn as (
    select
        ReportId,
        Method,
        Status,
        OrganizationId,
        StartedAt,
        row_number() over (
            partition by ReportId, Method, OrganizationId, cast(StartedAt as date) 
            order by status desc, StartedAt desc
        ) as rn
    from
        report
    qualify rn = 1
)
select
    ReportId,
    Method,
    Status,
    OrganizationId,
    StartedAt
from report_rn 
order by StartedAt;

+-----------+--------+--------+----------------+-------------------------------+
| REPORTID  | METHOD | STATUS | ORGANIZATIONID | STARTEDAT                     |
|-----------+--------+--------+----------------+-------------------------------|
| 38373bfk8 | Email  |    100 | ABC            | 2022-06-10 00:00:22.654000000 |
| 48978kd   | Email  |    100 | POD            | 2022-06-10 00:02:45.532000000 |
| 86887dhd  | Csv    |    100 | FGH            | 2022-06-10 00:03:12.541000000 |
| 78687fea  | Email  |      0 | XYZ            | 2022-06-10 00:03:51.432000000 |
| 78687fea  | Email  |    100 | XYZ            | 2022-06-11 00:04:51.352000000 |
+-----------+--------+--------+----------------+-------------------------------+

If it ran only once, either successful or failed will have RN = 1. If there is a successful one later on on the same day, because we order by status desc, status 100 will be on the top, so it will be selected.

If there are multiple failures on the same day, then the latest failed one will be returned because we again order by StartedAt desc as well.

Upvotes: 0

Related Questions