Reputation: 712
I have a dataset in which I need to look at all pairs of items that are together from within another group. I've created a toy example below to further explain.
BUNCH FRUITS
1 apples
1 bananas
1 mangos
2 apples
3 bananas
3 apples
4 bananas
4 apples
What I want is a listing of all possible pairs and sum the frequency they occur together within a bunch. My output would ideally look like this:
FRUIT1 FRUIT2 FREQUENCY
APPLES BANANAS 3
APPLES MANGOS 1
My end goal is to make something that I'll eventually be able to import into Gephi for a network analysis. For this I need a Source and Target column (aka FRUIT1 and FRUIT2 above).
I think there are a few other ways to approach this as well without using PROC SQL (Maybe using PROC TRANSPOSE) but this is where I've started.
SOLUTION
Thanks for the help. Sample code below for anyone interested in something similar:
proc sql;
create table fruit_combo as
select a.FRUIT as FRUIT1, b.FRUIT as FRUIT2, count(*) as FREQUENCY
from FRUITS a, FRUITS b
where a.BUNCH=b.BUNCH and and not a.FRUIT= b.FRUIT
group by FRUIT1, FRUIT2;
quit;
Upvotes: 6
Views: 511
Reputation: 507
Here's the copy/paste version of above. A simple reading shows errors - duplicate rows of counts for banana-apple and apple-banana. To get to the desired result an additional restriction was required (a.FRUIT gt b.FRUIT).
data FRUITS ;
input BUNCH FRUIT $;
cards;
1 apples
1 bananas
1 mangos
2 apples
3 bananas
3 apples
4 bananas
4 apples
;
run;
proc freq data=have ;
tables fruits;
run;
proc sql;
create table fruit_combo as
select a.FRUIT as FRUIT1, b.FRUIT as FRUIT2, count(*) as FREQUENCY
from FRUITS a, FRUITS b
where a.BUNCH=b.BUNCH
and a.FRUIT ne b.FRUIT
and a.FRUIT gt b.FRUIT
group by FRUIT1, FRUIT2;
quit;
proc print ; run;
Upvotes: 0
Reputation: 63434
Simplest approach is to do a cartesian (full) join of the table to itself, on t1.ID=t2.ID and t1.FRUIT ne t2.FRUIT. That will generate the full combination set, which you could then summarize.
Upvotes: 4