Reputation: 719
I have a dataset that looks like:
ColA ColB ColC ColD ColE
rs778 C Can + C/T
rs778 C Pro + C/T
rs779 P Can + A/G
rs779 P Can - A/G
I want to remove duplicate entries in Column A based on column C. Said another way, if two entries in Column A are the same, I want the row that stays to be determined by the entry in Column C. If the entries in Column C are the same, then the row that stays should be determined by Column D. If "Can" > "Pro" and "+" > "-", then the final output I'm looking for would look like this:
ColA ColB ColC ColD ColE
rs778 C Can + C/T
rs779 P Can + A/G
I removed completely duplicated data using:
data2 <- data[!duplicated(data[-2]),]
And am hoping my solution lies in some modification of this I have yet to discover. Thanks for your help!
Upvotes: 1
Views: 279
Reputation: 14667
Here is a solution that accomplishes what you need, but may not be the most graceful method possible.
data = read.table(header=TRUE, stringsAsFactors=FALSE,
text="ColA ColB ColC ColD ColE
rs778 C Can + C/T
rs778 C Pro + C/T
rs779 P Can + A/G
rs779 P Can - A/G")
# Convert ColC and ColD to factors, controlling sort order with levels arg.
# "Can" will sort before "Pro", and "+" will sort before "-".
data$ColC = factor(data$ColC, levels=c("Can", "Pro"))
data$ColD = factor(data$ColD, levels=c("+", "-"))
# Sort rows.
data = data[order(data$ColA, data$ColC, data$ColD), ]
# Works because prefered ColA duplicate sorts highest.
data2 = data[!duplicated(data$ColA), ]
data2
# ColA ColB ColC ColD ColE
# 1 rs778 C Can + C/T
# 3 rs779 P Can + A/G
Upvotes: 4