CadisEtRama
CadisEtRama

Reputation: 1111

extracting unique combination rows from a data frame in R

I have a data frame that gives pairwise correlations of scores people in the same state had provided. I am giving a small example of what I wish to do with this data, but right now my actual data set has 15 million rows for pairwise correlations and many more additional columns.

Below is the example data:

>sample_data

Pair_1ID    Pair_2ID    CORR    
1           2           0.12    
1           3           0.23    
2           1           0.12    
2           3           0.75    
3           1           0.23    
3           2           0.75    

I want to generate a new data frame without duplicates, for example in row 1, the correlation between persons 1 and 2 is 0.12. Row 1 is the same as Row 3, which shows the correlation between 2 and 1. Since they have the same information I would like a final file without duplicates, I would like a file like the one below:

>output 


Pair_1ID    Pair_2ID    CORR
    1        2          0.12
    1        3          0.23
    2        3          0.75

Can someone help? The unique command wont work with this and I don't know how to do it.

Upvotes: 2

Views: 1307

Answers (2)

Sven Hohenstein
Sven Hohenstein

Reputation: 81693

The answer of flodel is really excellent. I just want to add another solution based on indexing without looking at the actual values. It only works if all combinations are present and the data frame is ordered by column 1 in the first place and column 2 in the second place (like in the example).

maxVal <- max(sample_data$Pair_1ID)
shrtIdx <- logical(maxVal)
idx <- sapply(seq(maxVal - 1, 1), function(x) replace(shrtIdx, seq(x), TRUE))
sample_data[idx,]

#   Pair_1ID Pair_2ID CORR
# 1        1        2 0.12
# 2        1        3 0.23
# 4        2        3 0.75

Upvotes: 0

flodel
flodel

Reputation: 89057

Assuming every combination shows up twice:

subset(sample_data , Pair_1ID <= Pair_2ID)

If not:

unique(transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
                              Pair_2ID = pmax(Pair_1ID, Pair_2ID)))

Edit: regarding that last one, including CORR in the unique is not a great idea because of possible floating point issues. I also see you mention you have a lot more columns. So it would be better to limit the comparison to the two ids:

relabeled <- transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
                                    Pair_2ID = pmax(Pair_1ID, Pair_2ID))
subset(relabeled, !duplicated(cbind(Pair_1ID, Pair_2ID)))

Upvotes: 10

Related Questions