Reputation: 1177
I have a data.frame with 2 columns ("X" and "Y") that looks like this:
X Y
1_SNP_3 4
2_SNP_6 3
3_SNP_1 4
20_SNP_7 7
7_SNP_20 7
Using grepl or a similar function in R, I would like to compare all elements (strings) in X. Each string has a number at the beginning and the end, and all strings share a common substring pattern in between ("__SNP_"). I would like to remove only those rows which, when the numbers within the same string are inverted (e.g. from 1_SNP_3 to 3_SNP_1) form duplicate strings.
e.g. if numbers in "1_SNP_3" get inverted, string "3_SNP_1" results, which already exists, so one of these strings (and the corresponding row) gets removed.
I would get this:
X Y
1_SNP_3 4
2_SNP_6 3
20_SNP_7 7
Upvotes: 0
Views: 91
Reputation: 51592
Here is a solution using base R.
df[!duplicated(sapply(strsplit(gsub('\\D+', ' ', df$X), ' '), function(i) toString(sort(i)))),]
# X Y
#1 1_SNP_3 4
#2 2_SNP_6 3
#4 20_SNP_7 7
Upvotes: 3
Reputation: 105
# My first answer submission - A data table solution
# create the table
DT <- data.table(X = c("1_SNP_3","2_SNP_6","3_SNP_1","20_SNP_7","7_SNP_20"),
Y = c(4,3,4,7,7))
DT
# Extract first and last numbers
DT[, ':=' (B = gsub("_.*","",X),
E = gsub(".*_SNP_","",X))]
# Order the new columns so B is always less than E
DT[DT$B > DT$E , c("B", "E")] <- DT[DT$B > DT$E , c("E", "B")]
# Keep only the first instance , so delete duplicates
DT <- DT[, .SD[1], by=c("B","E")]
# Delete extra columns
DT [,c("B","E") := NULL]
DT
Answer :
X Y
1: 1_SNP_3 4
2: 2_SNP_6 3
3: 20_SNP_7 7
Upvotes: 2