Lucas
Lucas

Reputation: 1177

Remove rows from data.frame where numbers within opposite sides of strings sharing a common pattern do match

I have a data.frame with 2 columns ("X" and "Y") that looks like this:

X          Y
1_SNP_3    4
2_SNP_6    3
3_SNP_1    4
20_SNP_7   7
7_SNP_20   7

Using grepl or a similar function in R, I would like to compare all elements (strings) in X. Each string has a number at the beginning and the end, and all strings share a common substring pattern in between ("__SNP_"). I would like to remove only those rows which, when the numbers within the same string are inverted (e.g. from 1_SNP_3 to 3_SNP_1) form duplicate strings.

e.g. if numbers in "1_SNP_3" get inverted, string "3_SNP_1" results, which already exists, so one of these strings (and the corresponding row) gets removed.

I would get this:

X          Y
1_SNP_3    4
2_SNP_6    3
20_SNP_7   7 

Upvotes: 0

Views: 91

Answers (2)

Sotos
Sotos

Reputation: 51592

Here is a solution using base R.

df[!duplicated(sapply(strsplit(gsub('\\D+', ' ', df$X), ' '), function(i) toString(sort(i)))),]
#         X Y
#1  1_SNP_3 4
#2  2_SNP_6 3
#4 20_SNP_7 7

Upvotes: 3

DashingQuark
DashingQuark

Reputation: 105

# My first answer submission - A data table solution
# create the table
 DT <- data.table(X  = c("1_SNP_3","2_SNP_6","3_SNP_1","20_SNP_7","7_SNP_20"),
                              Y = c(4,3,4,7,7))
 DT
# Extract first and last numbers 
 DT[, ':=' (B = gsub("_.*","",X),
               E = gsub(".*_SNP_","",X))]
# Order the new columns so B is always less than E
 DT[DT$B > DT$E , c("B", "E")] <- DT[DT$B > DT$E , c("E", "B")]

# Keep only the first instance , so delete duplicates
 DT <- DT[, .SD[1], by=c("B","E")]
# Delete extra columns
 DT [,c("B","E") := NULL] 
 DT

Answer :       
   X Y
1:  1_SNP_3 4
2:  2_SNP_6 3
3: 20_SNP_7 7

Upvotes: 2

Related Questions