How to add the last value of an ID into the values of the next ID

This question should be very simple, but I don't exactly know how to do it in R. Basically I have a two-column file, in which, the first column represents scaffold IDs and the second column is the position of an SNP in that particular scaffold.

id    POS
0001  38
0001  46
0001  50
0002  17
0002  23
0002  46
0003  13
0003  19
0003  38
...   ...

And I would like to create a new column in which, the SNP position (NEW_POS) for the first scaffold is the same, but for the second scaffold (and thereafter) the NEW_POS would be the result of adding the last POS value of the previous scaffold to each POS in the second scaffold (i.e. 50+17, 50+23, 50+46,...). For the third scaffold (96+13, 96+19, 96+38, ...) and so on. As you can see here:

id    POS  NEW_POS
0001  38   38
0001  46   46
0001  50   50
0002  17   67
0002  23   73
0002  46   96
0003  13   109
0003  19   115
0003  38   134
...   ...  ...

Upvotes: 1

Views: 73

Answers (2)

Uwe
Uwe

Reputation: 42544

Here is a data.table solution which uses also shift() and cumsum() but updates on join

library(data.table)
DT[DT[, last(POS), id][, .(id, shift(cumsum(V1), fill = 0))], on = "id", 
   NEW_POS := POS + V2][]

which returns

     id POS NEW_POS
1: 0001  38      38
2: 0001  46      46
3: 0001  50      50
4: 0002  17      67
5: 0002  23      73
6: 0002  46      96
7: 0003  13     109
8: 0003  19     115
9: 0003  38     134

Explanation

tmp <- DT[, last(POS), id][, .(id, shift(cumsum(V1), fill = 0))][]
tmp
#     id V2
#1: 0001  0
#2: 0002 50
#3: 0003 96

picks the last value of each id group, shifts (lags) it by one, and computes the cumulative sum.

Then this result is right joined on id with the full data.table

DT[tmp, on = "id", NEW_POS := POS + V2][]

thereby creating NEW_POS in place.

Data

DT <- structure(list(id = c("0001", "0001", "0001", "0002", "0002", 
"0002", "0003", "0003", "0003"), POS = c(38L, 46L, 50L, 17L, 
23L, 46L, 13L, 19L, 38L)), .Names = c("id", "POS"), row.names = c(NA, 
-9L), class = "data.frame")
#coerce to data.table
setDT(DT)

Upvotes: 0

mt1022
mt1022

Reputation: 17289

Here is a solution using lag and cumsum:

library(dplyr)

df1 %>%
    mutate(
        scf.len = (id != lag(id, default = id[1])) * lag(POS, default = 0),
        New_POS = cumsum(scf.len) + POS
    )  %>%
    select(-scf.len)
#     id POS New_POS
# 1 0001  38      38
# 2 0001  46      46
# 3 0001  50      50
# 4 0002  17      67
# 5 0002  23      73
# 6 0002  46      96
# 7 0003  13     109
# 8 0003  19     115
# 9 0003  38     134

The data:

> dput(df1)
structure(list(id = c("0001", "0001", "0001", "0002", "0002", 
"0002", "0003", "0003", "0003"), POS = c(38, 46, 50, 17, 23, 
46, 13, 19, 38)), .Names = c("id", "POS"), class = "data.frame", row.names = c(NA, 
-9L))

Upvotes: 5

Related Questions