julia-sets
julia-sets

Reputation: 1

Is there a way to select rows in SAS that contain strings based on values from another dataset?

So I have two datasets. Dataset1 has a column "tags" that contains numbers separated by +. It looks like this:

id   complete   tags
1    true       11+27+5672+26+108+14
2    false      27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3    true       13+27+5672+2675+26+238895+23+16
4    true       10+27+5672+163333+26+25+43869595+234149+18567+19390192+13675649+23+16
5    true       10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
.
.

My second dataset has a list of "tags" like so:

id       name           count
+968+    TagName968     39421
+2675+   TagName2675    7450
.
.

What I would like to do is to go through and select all of the rows from Dataset1 where any of the tags in the tags column match any of the id rows in Dataset2. So basically any rows in Dataset1 that contain "+968+" or "+2675+" or any of the other ~3000 ids (so I can't hardcode them all in). So if I ran it based on the example above, rows 2, 3, and 5 would all be selected because they contain "+2675+".

I have tried different versions of DATA steps and PROC SQL, but I'm not sure I'm doing them right. The problem with PROC SQL is that I'm not really looking for a join, I don't think? Or at least it would have to be one-to-many.

I was thinking I might need to separate out the "tags" column so there's one tag per column, but the problem is that each row can have a wildly different number of tags (many of them have a lot more than given in the example) so I'm not sure that would be the best choice.

Upvotes: 0

Views: 409

Answers (1)

Kermit
Kermit

Reputation: 3117

Consider switching from wide to long format using a simple do loop and the scan function.

data have;
    input id complete $ tags :$2000.;
    cards;
1    true       11+27+5672+26+108+14
2    false      27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3    true       13+27+5672+2675+26+238895+23+16
4    true       10+27+5672+163333+26+25+43869595+234149+18567+19390192+13675649+23+16
5    true       10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
;

data mapping;
    input id $ name $ count;
    cards;
+968+    TagName968     39421
+2675+   TagName2675    7450
;
data tags;
    set have;
    do i=1 to countw(tags, '+');
        tag=input(scan(tags, i, '+'), 8.);
        output;
    end;
    drop i;
run;

proc sql;
    create table want as select distinct id, complete, tags 
    from tags 
    where tag in (select input(compress(id, '+'), 8.) as id from mapping);
quit;
id complete                         tags
2    false   27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3    true    13+27+5672+2675+26+238895+23+16
5    true    10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16

Upvotes: 1

Related Questions