extracting unique combination rows from a data frame in R

Question

I have a data frame that gives pairwise correlations of scores people in the same state had provided. I am giving a small example of what I wish to do with this data, but right now my actual data set has 15 million rows for pairwise correlations and many more additional columns.

Below is the example data:

>sample_data

Pair_1ID    Pair_2ID    CORR    
1           2           0.12    
1           3           0.23    
2           1           0.12    
2           3           0.75    
3           1           0.23    
3           2           0.75

I want to generate a new data frame without duplicates, for example in row 1, the correlation between persons 1 and 2 is 0.12. Row 1 is the same as Row 3, which shows the correlation between 2 and 1. Since they have the same information I would like a final file without duplicates, I would like a file like the one below:

>output 


Pair_1ID    Pair_2ID    CORR
    1        2          0.12
    1        3          0.23
    2        3          0.75

Can someone help? The unique command wont work with this and I don't know how to do it.

flodel · Accepted Answer

Assuming every combination shows up twice:

subset(sample_data , Pair_1ID <= Pair_2ID)

If not:

unique(transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
                              Pair_2ID = pmax(Pair_1ID, Pair_2ID)))

Edit: regarding that last one, including CORR in the unique is not a great idea because of possible floating point issues. I also see you mention you have a lot more columns. So it would be better to limit the comparison to the two ids:

relabeled <- transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
                                    Pair_2ID = pmax(Pair_1ID, Pair_2ID))
subset(relabeled, !duplicated(cbind(Pair_1ID, Pair_2ID)))

extracting unique combination rows from a data frame in R

Answers (2)

Related Questions