Andrew
Andrew

Reputation: 7798

Alternative way of joining two datasets in SAS

I have two datasets DS1 and DS2. DS1 is 100,000rows x 40cols, DS2 is 20,000rows x 20cols. I actually need to pull COL1 from DS1 if some fields match DS2. Since I am very-very new to SAS, I am trying to stick to SQL logic.

So basically I did (shot version)

proc sql;
...
SELECT DS1.col1 
FROM DS1 INNER JOIN DS2 
on DS1.COL2=DS2.COL3 
OR DS1.COL3=DS2.COL3 
OR DS1.COL4=DS2.COL2
...

After an hour or so, it was still running, but I was getting emails from SAS that I am using 700gb or so. Is there a better and faster SAS-way of doing this operation?

Upvotes: 2

Views: 377

Answers (3)

Sundar Krishnan
Sundar Krishnan

Reputation: 43

First sort the datasets that you are trying to merge using proc sort. Then merge the datasets based on id.

Here is how you can do it. I have assumed you match field as ID

proc sort data=DS1;
by ID;
proc sort data=DS2;
by ID;

data out;
merge DS1 DS2;
by ID;
run;

You can use proc sort for Ds3 and DS4 and then include them in merge statement if you need to join them as well.

Upvotes: 0

Robert Penridge
Robert Penridge

Reputation: 8523

You may have null or blank values in the columns you are joining on. Your query is probably matching all the null/blank columns together resulting in a very large result set.

I suggest adding additional clauses to exclude null results.

Also - if the same row happens to exist in both tables, then you should also prevent the row from joining to itself.

Either of these could effectively result in a cartesian product join (or something close to a cartesian product join).

EDIT : By the way - a good way of debugging this type of problem is to limit both datasets to a certain number of rows - say 100 in each - and then running it and checking the output to make sure it's expected. You can do this using the SQL options inobs=, outobs=, and loops=. Here's a link to the documentation.

Upvotes: 3

DomPazz
DomPazz

Reputation: 12465

I would use 3 separate queries and use a UNION

proc sql;
...
SELECT DS1.col1 
FROM DS1 INNER JOIN DS2 
on DS1.COL2=DS2.COL3 

UNION

SELECT DS1.col1 
FROM DS1 INNER JOIN DS2 
On DS1.COL3=DS2.COL3 

UNION

SELECT DS1.col1 
FROM DS1 INNER JOIN DS2 
ON DS1.COL4=DS2.COL2
...

Upvotes: 3

Related Questions