Reputation: 1
So I have two datasets. Dataset1 has a column "tags" that contains numbers separated by +. It looks like this:
id complete tags
1 true 11+27+5672+26+108+14
2 false 27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3 true 13+27+5672+2675+26+238895+23+16
4 true 10+27+5672+163333+26+25+43869595+234149+18567+19390192+13675649+23+16
5 true 10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
.
.
My second dataset has a list of "tags" like so:
id name count
+968+ TagName968 39421
+2675+ TagName2675 7450
.
.
What I would like to do is to go through and select all of the rows from Dataset1 where any of the tags in the tags column match any of the id rows in Dataset2. So basically any rows in Dataset1 that contain "+968+" or "+2675+" or any of the other ~3000 ids (so I can't hardcode them all in). So if I ran it based on the example above, rows 2, 3, and 5 would all be selected because they contain "+2675+".
I have tried different versions of DATA steps and PROC SQL, but I'm not sure I'm doing them right. The problem with PROC SQL is that I'm not really looking for a join, I don't think? Or at least it would have to be one-to-many.
I was thinking I might need to separate out the "tags" column so there's one tag per column, but the problem is that each row can have a wildly different number of tags (many of them have a lot more than given in the example) so I'm not sure that would be the best choice.
Upvotes: 0
Views: 409
Reputation: 3117
Consider switching from wide to long format using a simple do loop and the scan function.
data have;
input id complete $ tags :$2000.;
cards;
1 true 11+27+5672+26+108+14
2 false 27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3 true 13+27+5672+2675+26+238895+23+16
4 true 10+27+5672+163333+26+25+43869595+234149+18567+19390192+13675649+23+16
5 true 10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
;
data mapping;
input id $ name $ count;
cards;
+968+ TagName968 39421
+2675+ TagName2675 7450
;
data tags;
set have;
do i=1 to countw(tags, '+');
tag=input(scan(tags, i, '+'), 8.);
output;
end;
drop i;
run;
proc sql;
create table want as select distinct id, complete, tags
from tags
where tag in (select input(compress(id, '+'), 8.) as id from mapping);
quit;
id complete tags
2 false 27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3 true 13+27+5672+2675+26+238895+23+16
5 true 10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
Upvotes: 1