Reputation: 1111
I have a data frame that gives pairwise correlations of scores people in the same state had provided. I am giving a small example of what I wish to do with this data, but right now my actual data set has 15 million rows for pairwise correlations and many more additional columns.
Below is the example data:
>sample_data
Pair_1ID Pair_2ID CORR
1 2 0.12
1 3 0.23
2 1 0.12
2 3 0.75
3 1 0.23
3 2 0.75
I want to generate a new data frame without duplicates, for example in row 1, the correlation between persons 1 and 2 is 0.12. Row 1 is the same as Row 3, which shows the correlation between 2 and 1. Since they have the same information I would like a final file without duplicates, I would like a file like the one below:
>output
Pair_1ID Pair_2ID CORR
1 2 0.12
1 3 0.23
2 3 0.75
Can someone help? The unique command wont work with this and I don't know how to do it.
Upvotes: 2
Views: 1307
Reputation: 81693
The answer of flodel is really excellent. I just want to add another solution based on indexing without looking at the actual values. It only works if all combinations are present and the data frame is ordered by column 1 in the first place and column 2 in the second place (like in the example).
maxVal <- max(sample_data$Pair_1ID)
shrtIdx <- logical(maxVal)
idx <- sapply(seq(maxVal - 1, 1), function(x) replace(shrtIdx, seq(x), TRUE))
sample_data[idx,]
# Pair_1ID Pair_2ID CORR
# 1 1 2 0.12
# 2 1 3 0.23
# 4 2 3 0.75
Upvotes: 0
Reputation: 89057
Assuming every combination shows up twice:
subset(sample_data , Pair_1ID <= Pair_2ID)
If not:
unique(transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
Pair_2ID = pmax(Pair_1ID, Pair_2ID)))
Edit: regarding that last one, including CORR
in the unique
is not a great idea because of possible floating point issues. I also see you mention you have a lot more columns. So it would be better to limit the comparison to the two ids:
relabeled <- transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
Pair_2ID = pmax(Pair_1ID, Pair_2ID))
subset(relabeled, !duplicated(cbind(Pair_1ID, Pair_2ID)))
Upvotes: 10