How to remove column duplicates according to entry in another column (within a dataset)

Question

I have a dataset that looks like:

    ColA  ColB   ColC  ColD  ColE
    rs778   C   Can     +   C/T
    rs778   C   Pro     +   C/T
    rs779   P   Can     +   A/G
    rs779   P   Can     -   A/G

I want to remove duplicate entries in Column A based on column C. Said another way, if two entries in Column A are the same, I want the row that stays to be determined by the entry in Column C. If the entries in Column C are the same, then the row that stays should be determined by Column D. If "Can" > "Pro" and "+" > "-", then the final output I'm looking for would look like this:

    ColA   ColB ColC   ColD ColE
    rs778   C   Can     +   C/T
    rs779   P   Can     +   A/G

I removed completely duplicated data using:

data2 <- data[!duplicated(data[-2]),]

And am hoping my solution lies in some modification of this I have yet to discover. Thanks for your help!

bdemarest · Accepted Answer

Here is a solution that accomplishes what you need, but may not be the most graceful method possible.

data = read.table(header=TRUE, stringsAsFactors=FALSE,
                  text="ColA  ColB   ColC  ColD  ColE
                        rs778   C   Can     +   C/T
                        rs778   C   Pro     +   C/T
                        rs779   P   Can     +   A/G
                        rs779   P   Can     -   A/G")

# Convert ColC and ColD to factors, controlling sort order with levels arg.
# "Can" will sort before "Pro", and "+" will sort before "-".
data$ColC = factor(data$ColC, levels=c("Can", "Pro"))
data$ColD = factor(data$ColD, levels=c("+", "-"))

# Sort rows.
data = data[order(data$ColA, data$ColC, data$ColD), ]

# Works because prefered ColA duplicate sorts highest.
data2 = data[!duplicated(data$ColA), ]

data2
#    ColA ColB ColC ColD ColE
# 1 rs778    C  Can    +  C/T
# 3 rs779    P  Can    +  A/G

How to remove column duplicates according to entry in another column (within a dataset)

Answers (1)

Related Questions