Reputation: 21
This question should be very simple, but I don't exactly know how to do it in R. Basically I have a two-column file, in which, the first column represents scaffold IDs and the second column is the position of an SNP in that particular scaffold.
id POS
0001 38
0001 46
0001 50
0002 17
0002 23
0002 46
0003 13
0003 19
0003 38
... ...
And I would like to create a new column in which, the SNP position (NEW_POS) for the first scaffold is the same, but for the second scaffold (and thereafter) the NEW_POS would be the result of adding the last POS value of the previous scaffold to each POS in the second scaffold (i.e. 50+17, 50+23, 50+46,...). For the third scaffold (96+13, 96+19, 96+38, ...) and so on. As you can see here:
id POS NEW_POS
0001 38 38
0001 46 46
0001 50 50
0002 17 67
0002 23 73
0002 46 96
0003 13 109
0003 19 115
0003 38 134
... ... ...
Upvotes: 1
Views: 73
Reputation: 42544
Here is a data.table
solution which uses also shift()
and cumsum()
but updates on join
library(data.table)
DT[DT[, last(POS), id][, .(id, shift(cumsum(V1), fill = 0))], on = "id",
NEW_POS := POS + V2][]
which returns
id POS NEW_POS
1: 0001 38 38
2: 0001 46 46
3: 0001 50 50
4: 0002 17 67
5: 0002 23 73
6: 0002 46 96
7: 0003 13 109
8: 0003 19 115
9: 0003 38 134
tmp <- DT[, last(POS), id][, .(id, shift(cumsum(V1), fill = 0))][]
tmp
# id V2
#1: 0001 0
#2: 0002 50
#3: 0003 96
picks the last value of each id
group, shifts (lags) it by one, and computes the cumulative sum.
Then this result is right joined on id
with the full data.table
DT[tmp, on = "id", NEW_POS := POS + V2][]
thereby creating NEW_POS
in place.
DT <- structure(list(id = c("0001", "0001", "0001", "0002", "0002",
"0002", "0003", "0003", "0003"), POS = c(38L, 46L, 50L, 17L,
23L, 46L, 13L, 19L, 38L)), .Names = c("id", "POS"), row.names = c(NA,
-9L), class = "data.frame")
#coerce to data.table
setDT(DT)
Upvotes: 0
Reputation: 17289
Here is a solution using lag
and cumsum
:
library(dplyr)
df1 %>%
mutate(
scf.len = (id != lag(id, default = id[1])) * lag(POS, default = 0),
New_POS = cumsum(scf.len) + POS
) %>%
select(-scf.len)
# id POS New_POS
# 1 0001 38 38
# 2 0001 46 46
# 3 0001 50 50
# 4 0002 17 67
# 5 0002 23 73
# 6 0002 46 96
# 7 0003 13 109
# 8 0003 19 115
# 9 0003 38 134
The data:
> dput(df1)
structure(list(id = c("0001", "0001", "0001", "0002", "0002",
"0002", "0003", "0003", "0003"), POS = c(38, 46, 50, 17, 23,
46, 13, 19, 38)), .Names = c("id", "POS"), class = "data.frame", row.names = c(NA,
-9L))
Upvotes: 5