Find and remove rows that are identical in 3 columns and differ in 1

I have binned my data in intervals (of 100000) using 2 different frames: from 0 to 100000 and onwards, and from 50000 to 150000 and onwards. I then joined both dataframes, using one column as identifier for the frames (represented in column "x100kb").

For my purpose, if 2 rows (edit: they don't need to be sequent to each other; since the data is not ordered by "chr" and "x100kb" right now) differ in "x100kb" by 0.5 (preferably comparing whole numbers to their +0.5; eg: 60 to 60.5, 65 to 65.5; etc) but they have the same values in "chr" and "occurrences_norm" and "occurrences_tum"; then they are equal and I want to remove one of them. The only thing coming to mind now are loops, which obviusly is not very productive...

Data example:

       chr    x100Kb occurrences_norm    occurrences_tum   fold
19064 chr17   61.5               17               0 14.05333
38799  chr5  526.0               16               0 13.96587
38800  chr5  526.5               16               0 13.96587
39946  chr5 1113.5               16               0 13.96587
2377   chr1 1426.0               15               0 13.87277
21859 chr18  733.5               15               0 13.87277
20538 chr18   24.0               14               0 13.77324
21863 chr18  735.5               14               0 13.77324
37699  chr4 1835.5               14               0 13.77324
39924  chr5 1102.5               14               0 13.77324
21506 chr18  550.5               13               0 13.66633
21862 chr18  735.0               13               0 13.66633
22258 chr19  151.5               13               0 13.66633
38972  chr5  613.0               13               0 13.66633
41707  chr6  194.5               13               0 13.66633
2380   chr1 1427.5               12               0 13.55087
20541 chr18   25.5               12               0 13.55087
21252 chr18  421.0               12               0 13.55087
27384  chr2 2243.0               12               0 13.55087
39990  chr5 1135.5               12               0 13.55087

In the example, the 3rd row would be removed.

Upvotes: 1

Views: 77

Answers (3)

jazzurro
jazzurro

Reputation: 23574

I read the question in a different way. I thought we need to compare any two sequent rows. For example, check row 1 & 2, row 2 & 3, and so on. I also thought that the condition is the difference in x100Kb is 0.5, not large than 0.5. I thought running four logical checks, using shift(), would be one way to achieve the goal.

setDT(df1)[!((abs(x100Kb - shift(x100Kb, type = "lag", fill = -Inf)) == 0.5) &
             (chr == shift(chr, type = "lag")) &
             (occurrences_norm == shift(occurrences_norm, type = "lag")) &
             (occurrences_tum == shift(occurrences_tum, type = "lag")))
           ]

#      chr x100Kb occurrences_norm occurrences_tum     fold
# 1: chr17   61.5               17               0 14.05333
# 2:  chr5  526.0               16               0 13.96587
# 3:  chr5 1113.5               16               0 13.96587
# 4:  chr1 1426.0               15               0 13.87277
# 5: chr18  733.5               15               0 13.87277
# 6: chr18   24.0               14               0 13.77324
# 7: chr18  735.5               14               0 13.77324
# 8:  chr4 1835.5               14               0 13.77324
# 9:  chr5 1102.5               14               0 13.77324
#10: chr18  550.5               13               0 13.66633
#11: chr18  735.0               13               0 13.66633
#12: chr19  151.5               13               0 13.66633
#13:  chr5  613.0               13               0 13.66633
#14:  chr6  194.5               13               0 13.66633
#15:  chr1 1427.5               12               0 13.55087
#16: chr18   25.5               12               0 13.55087
#17: chr18  421.0               12               0 13.55087
#18:  chr2 2243.0               12               0 13.55087
#19:  chr5 1135.5               12               0 13.55087

Upvotes: 2

akrun
akrun

Reputation: 887078

We could also the data.table

library(data.table)
setDT(df1)[df1[,  .I[abs(x100Kb - shift(x100Kb, fill = -Inf)) > 0.5]  , 
                  by =  .(chr, occurrences_norm, occurrences_tum)]$V1]
#      chr x100Kb occurrences_norm occurrences_tum     fold
# 1: chr17   61.5               17               0 14.05333
# 2:  chr5  526.0               16               0 13.96587
# 3:  chr5 1113.5               16               0 13.96587
# 4:  chr1 1426.0               15               0 13.87277
# 5: chr18  733.5               15               0 13.87277
# 6: chr18   24.0               14               0 13.77324
# 7: chr18  735.5               14               0 13.77324
# 8:  chr4 1835.5               14               0 13.77324
# 9:  chr5 1102.5               14               0 13.77324
#10: chr18  550.5               13               0 13.66633
#11: chr18  735.0               13               0 13.66633
#12: chr19  151.5               13               0 13.66633
#13:  chr5  613.0               13               0 13.66633
#14:  chr6  194.5               13               0 13.66633
#15:  chr1 1427.5               12               0 13.55087
#16: chr18   25.5               12               0 13.55087
#17: chr18  421.0               12               0 13.55087
#18:  chr2 2243.0               12               0 13.55087
#19:  chr5 1135.5               12               0 13.55087

Upvotes: 1

user2100721
user2100721

Reputation: 3587

Try this using dplyr package

library(dplyr)
df1 %>% group_by(chr,occurrences_norm,occurrences_tum) %>% 
mutate(tmp=diff(c(0,x100Kb))) %>% filter(tmp>0.5) %>% select(-tmp)

# chr x100Kb occurrences_norm occurrences_tum     fold
# (fctr)  (dbl)            (int)           (int)    (dbl)
# 1   chr17   61.5               17               0 14.05333
# 2    chr5  526.0               16               0 13.96587
# 3    chr5 1113.5               16               0 13.96587
# 4    chr1 1426.0               15               0 13.87277
# 5   chr18  733.5               15               0 13.87277
# 6   chr18   24.0               14               0 13.77324
# 7   chr18  735.5               14               0 13.77324
# 8    chr4 1835.5               14               0 13.77324
# 9    chr5 1102.5               14               0 13.77324
# 10  chr18  550.5               13               0 13.66633
# 11  chr18  735.0               13               0 13.66633
# 12  chr19  151.5               13               0 13.66633
# 13   chr5  613.0               13               0 13.66633
# 14   chr6  194.5               13               0 13.66633
# 15   chr1 1427.5               12               0 13.55087
# 16  chr18   25.5               12               0 13.55087
# 17  chr18  421.0               12               0 13.55087
# 18   chr2 2243.0               12               0 13.55087
# 19   chr5 1135.5               12               0 13.55087

Upvotes: 0

Related Questions