Reputation: 31

Drop all observations by ID where conditions are not met

I have a dataset with ~4 million transactional records, grouped by Customer_No (consisting of 1 or more transactions per Customer_No, denoted by a sequential counter). Each transaction has a Type code and I am only interested in customers where a particular combination of transaction Types were used. Neither joining the table on itself or using EXISTS in Proc Sql is allowing me to efficiently evaluate the transaction Type criteria. I suspect a data step using retain and do-loops would process the dataset faster

The dataset:

Customer_No Tran_Seq    Tran_Type
    0001        1           05
    0001        2           12
    0002        1           07
    0002        2           86
    0002        3           04
    0003        1           07
    0003        2           84
    0003        3           84
    0003        4           84

The criteria I am trying to apply:

All Customer_No's Tran_Type's must only be in ('04','05','07','84','86'), drop all transactions for that Customer_No if any other Tran_Type was used
Customer_No's Tran_Type's must include ('84' or '86') AND '04', drop all transactions for the Customer_No if this condition is not met

The output I want:

Customer_No Tran_Seq    Tran_Type
0002        1           07
0002        2           86
0002        3           04

Upvotes: 3

Answers (4)

Joe

Reputation: 63434

The DoW loop solution should be the most efficient if the data is sorted. If it's not sorted, it will either be the most efficient or similar in scale but slightly less efficient depending on the circumstances of the dataset.

I compared to Dom's solution with a 3e7 ID dataset, and got for the DoW a similar (slightly less) total length with less CPU for unsorted dataset, and about 50% faster for sorted. It is guaranteed to run in about the length of time the dataset takes to write out (maybe a bit more, but it shouldn't be much), plus sorting time if needed.

data want;
  do _n_=1 by 1 until (last.customer_no);
      set have;
      by customer_no;  
      if tran_type in ('84','86') 
        then has_8486 = 1;
      else if tran_type in ('04') 
        then has_04 = 1;
      else if not (tran_type in ('04','05','07','84','86')) 
        then has_other = 1;
  end;
  do _n_= 1 by 1 until (last.customer_no);
    set have;
    by customer_no;
    if has_8486 and has_04 and not has_other then output;
  end;
run;

Upvotes: 2

DWal

Reputation: 2762

I don't think it's that complicated. Join to a subquery, group by Customer_No, and put your conditions in a having clause. A condition in a min function must be true for all rows, whereas a condition in a max function must be true for any one row:

proc sql;
create table want as
select
  h.*
from
  have h
  inner join (
    select
      Customer_No
    from
      have
    group by
      Customer_No
    having
      min(Tran_Type in('04','05','07','84','86')) and
      max(Tran_Type in('84','86')) and
      max(Tran_Type eq '04')) h2
  on h.Customer_No = h2.Customer_No
;
quit;

Upvotes: 1

DomPazz

Reputation: 12465

I would offer a slightly less complex SQL solution than @naed555 using the INTERSECT operator.

proc sql noprint;

create table to_keep as
(
    select distinct customer_no
    from have 
    where tran_type in ('84','86')

    INTERSECT

    select distinct customer_no
    from have 
    where tran_type in ('04')
)

EXCEPT

    select distinct customer_no
    from have
    where tran_type not in ('04','05','07','84','86')
;

create table want as
select a.*
from have as a
inner join 
     to_keep as b
on a.customer_no = b.customer_no;

quit;

Upvotes: 0

naed555

Reputation: 31

I must have made a join error. On re-writing, Proc Sql completed in less than 30 seconds (on the original 4.9 million record dataset). It's not particularly elegant code though, so I'd still appreciate any improvements or alternative methods.

data Have;
input Customer_No $ Tran_Seq $ Tran_Type:$2.;
cards;
    0001        1           05
    0001        2           12
    0002        1           07
    0002        2           86
    0002        3           04
    0003        1           07
    0003        2           84
    0003        3           84
    0003        4           84
;
run;

Proc sql;
Create table Want as
select t1.* from Have t1
LEFT JOIN (select DISTINCT Customer_No from Have
                    where Tran_Type not in ('04','05','07','84','86')
                                  ) t2
ON(t1.Customer_No=t2.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
                    where Tran_Type in ('84','86')
                                ) t3
ON(t1.Customer_No=t3.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
                    where Tran_Type in ('04')
                                ) t4
ON(t1.Customer_No=t4.Customer_No)
Where t2.Customer_No is null
;Quit;

Upvotes: 0

Drop all observations by ID where conditions are not met

Answers (4)

Related Questions