Nancy
Nancy

Reputation: 101

Subset a dataframe with specific condition in R

hello I have this df

        res1 res4 aa1234

    1     1    4   IVGG
    2    10   13   RQFP
    3   102  105   TSSV
    4   112  115   LQNA
    5   118  121   EAGT
    6    12   15   FPFL
    7   132  135   RSGG
    8   138  141   SRFP
    9   150  153   PEDQ
    10  151  154   EDQC
    11  155  158   RPNN
    12  165  168   TRRG
    13  171  174   CNGD
    14  172  175   NGDG
    15  174  177   DGGT
    16  181  184   CEGL
    17  195  198   PCGR
    18   20   23   NQGR
    19  205  208   RVAL
    20   32   35   HARF
    21   39   42   AASC
    22   40   43   ASCF
    23   48   51   PGVS
    24   57   60   AYDL
    25   59   62   DLRR
    26   64   67   ERQS
    27   65   68   RQSR
    28   78   81   ENGY
    29    8   11   RPRQ
    30   82   85   DPQQ
    31   83   86   PQQN
    32   86   89   NLND
    33   95   98   LDRE

I want to subset it considering only rows in which res1 are in sequence as i and i <= i+4, as :

   res1 res4 aa1234
29    8   11   RPRQ
 6   12   15   FPFL
21   39   42   AASC
22   40   43   ASCF
24   57   60   AYDL
25   59   62   DLRR
26   64   67   ERQS
27   65   68   RQSR
28   78   81   ENGY
30   82   85   DPQQ
31   83   86   PQQN
32   86   89   NLND
9   150  153   PEDQ
10  151  154   EDQC
11  155  158   RPNN
13  171  174   CNGD
14  172  175   NGDG
15  174  177   DGGT

I tried something woth functions "filter" and "subset" but I didn't got the result expected.

So in general, I need to have the overlap between two rows in a range (i-i+4) including i+4.

For example, in this 3 lines there is the overlap between rows [9] and [10] (150-153 overlaps with 151-154), but also row [11] corresponds to res1[10] + 4 (151+4 = 155). So maybe an idea should be to consider res1[i] and check if res1[i+1] is =< res[i].

9   150  153   PEDQ
10  151  154   EDQC
11  155  158   RPNN

Upvotes: 0

Views: 70

Answers (1)

AnilGoyal
AnilGoyal

Reputation: 26238

why not we are simply doing this?

df[df$res1 %in% c(df$res1 -4,df$res1 -3, df$res1-2,  df$res1 -1, df$res1+1,df$res1  +2, df$res1 +3, df$res1 +4),]

   res1 res4 aa1234
2    10   13   RQFP
6    12   15   FPFL
9   150  153   PEDQ
10  151  154   EDQC
11  155  158   RPNN
13  171  174   CNGD
14  172  175   NGDG
15  174  177   DGGT
21   39   42   AASC
22   40   43   ASCF
24   57   60   AYDL
25   59   62   DLRR
26   64   67   ERQS
27   65   68   RQSR
28   78   81   ENGY
29    8   11   RPRQ
30   82   85   DPQQ
31   83   86   PQQN
32   86   89   NLND

edited scenario just order the df, and rest will be same. See

df <- df[order(df$res1),]
df[sort(unique(c(which(rev(diff(rev(df$res1))) >= -3 & rev(diff(rev(df$res1))) <= 0), which(diff(df$res1) <= 4 & diff(df$res1) >= 0)+1))),]

   res1 res4 aa1234
29    8   11   RPRQ
2    10   13   RQFP
6    12   15   FPFL
21   39   42   AASC
22   40   43   ASCF
24   57   60   AYDL
25   59   62   DLRR
26   64   67   ERQS
27   65   68   RQSR
30   82   85   DPQQ
31   83   86   PQQN
32   86   89   NLND
9   150  153   PEDQ
10  151  154   EDQC
11  155  158   RPNN
13  171  174   CNGD
14  172  175   NGDG
15  174  177   DGGT

old answer Use this

df[sort(unique(c(which(rev(diff(rev(df$res1))) >= -3 & rev(diff(rev(df$res1))) <= 0), which(diff(df$res1) <= 4 & diff(df$res1) >= 0)+1))),] 


   res1 res4 aa1234
9   150  153   PEDQ
10  151  154   EDQC
11  155  158   RPNN
13  171  174   CNGD
14  172  175   NGDG
15  174  177   DGGT
21   39   42   AASC
22   40   43   ASCF
24   57   60   AYDL
25   59   62   DLRR
26   64   67   ERQS
27   65   68   RQSR
30   82   85   DPQQ
31   83   86   PQQN
32   86   89   NLND

Data used

df <- read.table(text = "res1 res4 aa1234 
1 1 4 IVGG 
2 10 13 RQFP 
3 102 105 TSSV 
4 112 115 LQNA 
5 118 121 EAGT 
6 12 15 FPFL 
7 132 135 RSGG 
8 138 141 SRFP 
9 150 153 PEDQ 
10 151 154 EDQC 
11 155 158 RPNN 
12 165 168 TRRG 
13 171 174 CNGD 
14 172 175 NGDG 
15 174 177 DGGT 
16 181 184 CEGL 
17 195 198 PCGR 
18 20 23 NQGR 
19 205 208 RVAL 
20 32 35 HARF 
21 39 42 AASC 
22 40 43 ASCF 
23 48 51 PGVS 
24 57 60 AYDL 
25 59 62 DLRR 
26 64 67 ERQS 
27 65 68 RQSR 
28 78 81 ENGY 
29 8 11 RPRQ 
30 82 85 DPQQ 
31 83 86 PQQN 
32 86 89 NLND 
33 95 98 LDRE", header = T) 

Upvotes: 1

Related Questions