JW2
JW2

Reputation: 349

Comparing two variables in different datasets

I have two datasets:

set.one & set.two

var1 in set.one contains phrases, some of those containing states and cities

set.two contains a list of US States.

Using SAS or PROC SQL what is the most efficient way of comparing the two, with the intent of deleting the phrases that contain US States?

thank you for your suggestions.

Upvotes: 1

Views: 1293

Answers (2)

user1965813
user1965813

Reputation: 671

If set.one is not too big, you could go for an sql-outer join and return distinct values that have no match. Like so:

PROC SQL;
CREATE TABLE phrases_without_states AS
SELECT DISTINCT A.phrase
FROM set.one A OUTER JOIN set.two B
WHERE A.phrase NOT LIKE '%' + B.state + '%'
;
quit;

I am a bit unsure about the comparison though. Does anyone know how one uses like with other variables in stead of fixed strings?

Upvotes: 0

Joe
Joe

Reputation: 63434

I would say that your best bet is to create a format with set two, or a SQL join, depending on the details. Something like:

data one;
format var1 $50.;
infile datalines truncover;
input @1 var1 $50.;
datalines;
Arizona
Kalamazoo
California
New Mexico
Las Cruces, New Mexico
California Pizza Kitchen
Cheese
;;;;
run;
data two;
format state $20.;
infile datalines truncover;
input @1 state $20.;
datalines;
Alabama
Arizona
Alaska
Colorado
New Mexico
California
Missouri
;;;;
run;

data for_fmt;
set two;
start=state;
label='STATE';
retain fmtname "$statef";
output;
if _n_ = 1 then do;
start=' ';
label='FALSE';
hlo='o';
output;
end;
run;

proc format cntlin=for_fmt;
quit
;

data want1;
set one;
if put(var1,$statef.)='STATE' then delete;
run;

data want2;
set one;
do __t = 1 to countc(',',var1)+1;
  x=strip(scan(var1,__t,','));
  if put(x,$STATEF.)='STATE' then delete;
end;
run;

proc sql;
  create table want3 as select * from one where not exists (
    select 1 from two where find(strip(one.var1),strip(two.state)) > 0
    );
quit;

WANT1 is simplest - requires the entire VAR1 to be a state. Also, fastest. WANT2 is slightly more complicated; requires either all of VAR1 to be a state, or to be comma delimited (like city,state). WANT3 allows most complex matches (anywhere, including something like "GeorgiaPeachClub" would match, without any spaces at all). It is also, however, very slow depending on the size of the dataset (it is a cartesian join).

If you have a very big dataset there are other solutions that are probably faster; perhaps a hash table in a SAS dataset, or even iterating over the two dataset inside the one dataset, or perhaps using an indexed search instead of a straight up search. But you don't have very much information in your question. More detail, like example data from your tables and the size of the tables, would be helpful.

Upvotes: 1

Related Questions