Reputation: 157
I wasn't able to find an answer anywhere, I probably didn't get the right search terms or wasn't able to transfer the problems to mine.
So I hope that someone here is able to help me.
I have a data.table dt1 in the following form (I tried to keep it short, but needed to include all possibilities):
ID session
101 1
101 1
101 2
101 4
102 2
102 4
102 5
103 1
103 4
201 1
201 4
201 5
202 1
202 2
203 1
204 5
Code to reproduce this:
dt1 <- data.table(ID=c(101, 101, 101, 101, 102, 102, 102, 103, 103, 201, 201, 201, 202, 202, 203, 204), session=c(1, 1, 2, 4, 2, 4, 5, 1, 4, 1, 4, 5, 1, 2, 1, 5))
What I want in a first step is a data.table in the form, where there is a 1 for each session when there is an entry in the input data.frame and a 0 where not.
ID 1 2 3 4 5
101 1 1 0 1 0
102 0 1 0 1 1
103 1 0 0 1 0
201 1 0 0 1 1
202 1 1 0 0 0
203 1 0 0 0 0
204 0 0 0 0 1
Right now, I am generating two lists,
IDs <- sort(unique(dt1$ID))
sessions <- unique(dt1$session)
an empty data.table dt2
with ncol=length(sessions)
and nrow=length(IDs)
, with the sessions as column names
dt2 <- data.table(matrix(ncol=length(sessions), nrow=length(IDs)))
colnames(dt2) <- as.character(unique(dt1$session))
and a list with sessions per ID.
sesID <- split(dt1$session, dt1$ID)
Then I run through the lists with two for loops.
for (i in 1:nrow(dt2)) {
for (j in 1:length(dt2)) {
if (sessions[j] %in% sesID[i]) {
set(dt2, i, j, 1)s
}
else {
set(dt2, i, j, 0)
} } }
As a second step, I want to change all the 0s into 1s, if the the sessions lies between sessions with 1s.
ID 1 2 3 4 5
101 1 1 1 1 0
102 0 1 1 1 1
103 1 1 1 1 0
201 1 0 0 1 1
202 1 1 0 0 0
203 1 0 0 0 0
204 0 0 0 0 1
I am doing this with another two for loops.
for (i in 1:nrow(dt2)) {
trues <- which(dt2[i,]==1)
headTrues <- head(trues, 1)
tailTrues <- tail(trues, 1)
for (j in 1:length(dt2)){
if (j > headTrues & j < tailTrues & headTrues <= tailTrues){
set(dt2, i, j, 1)
} } }
As this generates me a data.table dt3 with TRUEs and FALSEs I replace them afterwards.
(to.replace <- names(which(sapply(dt3, is.logical))))
for (var in to.replace) dt3[, var:= as.numeric(get(var)), with=FALSE]
To keep the IDs as a column, I add them afterwards.
dt3$ID <- IDs
This would be okay, if I wouldn't have around 12000 unique IDs and needed to do a couple of thousands runs. I am pretty sure that there are much better ways to do this in R. I just don't now them yet.
Thank you very much in advance.
Upvotes: 3
Views: 160
Reputation: 83275
Using:
# create a reference data.table which includes also 'session 3'
ref <- CJ(ID = dt1$ID, session = min(dt1$session):max(dt1$session), unique = TRUE)
# join 'ref' with 'dt1' and create a new variable that has NA's
# for values that don't exist in 'dt1$session'
ref[dt1, on = c('ID','session'), ses2 := i.session]
# summarise to create a dummy and reshape to wide format with the 'dcast'-function
dcast(ref[, sum(!is.na(ses2)), .(ID,session)],
ID ~ session, value.var = 'V1')
you get:
ID 1 2 3 4 5
1: 101 1 1 0 1 0
2: 102 0 1 0 1 1
3: 103 1 0 0 1 0
4: 201 1 0 0 1 1
5: 202 1 1 0 0 0
6: 203 1 0 0 0 0
7: 204 0 0 0 0 1
An alternative (as proposed by @Frank in the comments):
dt1[, session := factor(session, levels=1:5)]
dcast(dt1, ID ~ session, fun = function(x) sign(length(x)), drop = FALSE)
which will give you the same result.
If you want to fill the zero's between 1's, you could use the shift
-function to check whether the preceding and the next value are equal to 1
:
dcast(ref[, sum(!is.na(ses2)), .(ID,session)
][shift(V1,1,0,'lag')==1 & shift(V1,1,0,'lead')==1, V1 := 1L, ID],
ID ~ session, value.var = 'V1')
you will then get:
ID 1 2 3 4 5
1: 101 1 1 1 1 0
2: 102 0 1 1 1 1
3: 103 1 0 0 1 1
4: 201 1 0 0 1 1
5: 202 1 1 0 0 0
6: 203 1 0 0 0 0
7: 204 0 0 0 0 1
In response to your comment, to replace all zero's between 1's you can use a combination of the rle
and inverse.rle
functions:
dt2 <- unique(dt1)[, val := 1
][CJ(ID = ID, session = min(session):max(session), unique = TRUE), on = c('ID','session')
][is.na(val), val := 0
][, val := {rl <- rle(val);
rl$values[rl$values==0 & shift(rl$values,fill=0)==1 & shift(rl$values,fill=0,type='lead')==1] <- 1;
inverse.rle(rl)},
ID]
dcast(dt2, ID ~ session, value.var = 'val')
This gives:
ID 1 2 3 4 5
1: 101 1 1 1 1 0
2: 102 0 1 1 1 1
3: 103 1 1 1 1 0
4: 201 1 1 1 1 1
5: 202 1 1 0 0 0
6: 203 1 0 0 0 0
7: 204 0 0 0 0 1
Alternately (@Frank's idea):
ref[, v := 0L]
ref[dt1[, .(first(session), last(session)), by=ID], on=.(ID, session >= V1, session <= V2),
v := 1L ]
dcast(ref, ID ~ session)
When all different session numbers are present in the dataset, you also use a nested dcast
/melt
-approach as an alternative to one with a cross-join (with regard to speed and memory efficiency, the previous approach with a cross-join (CJ
) is preferrable).
New example dataset:
DT <- data.table(ID=c(101, 101, 101, 101, 102, 102, 102, 103, 103, 201, 201, 201, 202, 202, 203, 204),
session=c(1, 2, 3, 4, 2, 4, 5, 1, 4, 1, 4, 5, 1, 2, 1, 5))
The code:
dcast(melt(dcast(DT[, val := 1],
ID ~ session,
value.var = 'val',
fill = 0),
id = 'ID')[, value := {rl <- rle(value);
rl[[2]][rl[[2]]==0 & shift(rl[[2]],1,0)==1 & shift(rl[[2]],1,0,'lead')==1] <- 1;
inverse.rle(rl)},
ID],
ID ~ variable, value.var = 'value')
This gives:
ID 1 2 3 4 5
1: 101 1 1 1 1 0
2: 102 0 1 1 1 1
3: 103 1 1 1 1 0
4: 201 1 1 1 1 1
5: 202 1 1 0 0 0
6: 203 1 0 0 0 0
7: 204 0 0 0 0 1
Upvotes: 4
Reputation: 2313
One way is to use reshape
.
First create a column value
equal to 1:
dt1[, value := 1]
And now reshape
it to wide
format:
dt1.1 <- reshape(dt1, direction = "wide", idvar = "ID", timevar = "session")
You will get this:
ID value.1 value.2 value.4 value.5
1: 101 1 1 1 NA
2: 102 NA 1 1 1
3: 103 1 NA 1 NA
4: 201 1 NA 1 1
5: 202 1 1 NA NA
6: 203 1 NA NA NA
7: 204 NA NA NA 1
Replace NA
with 0
:
dt1.1[is.na(dt1.1)] <- 0
ID value.1 value.2 value.4 value.5
1: 101 1 1 1 0
2: 102 0 1 1 1
3: 103 1 0 1 0
4: 201 1 0 1 1
5: 202 1 1 0 0
6: 203 1 0 0 0
7: 204 0 0 0 1
Alternatively with dcast
:
dcast(ID ~ session, data = dt1, fun.aggregate = function(x) as.numeric(length(x) > 0))
ID 1 2 4 5
1 101 1 1 1 0
2 102 0 1 1 1
3 103 1 0 1 0
4 201 1 0 1 1
5 202 1 1 0 0
6 203 1 0 0 0
7 204 0 0 0 1
Upvotes: 2
Reputation: 1421
You can do the first step in this way... Is that what you are looking for?
library(dplyr)
df_dt1 %>% group_by (ID) %>% summarize (S1 = as.integer(sum(session == 1)>0),
S2 = as.integer(sum(session ==2)>0),
S3 = as.integer(sum(session ==3)>0),
S4 = as.integer(sum(session ==4)>0),
S5 = as.integer(sum(session ==5)>0))
you get
ID S1 S2 S3 S4 S5
<dbl> <int> <int> <int> <int> <int>
1 101 1 1 0 1 0
2 102 0 1 0 1 1
3 103 1 0 0 1 0
4 201 1 0 0 1 1
5 202 1 1 0 0 0
6 203 1 0 0 0 0
7 204 0 0 0 0 1
Upvotes: 0