Reputation: 77
I need to create a complicated "for" loop, but after reading some examples I'm still clueless of how to write it in a proper R way and therefore I'm not sure whether it will work or not. I'm still an R beginner :(
I have a dataset in the long format, with different occasions, however, some occasions are not truly new ones since the date of start is the same, but have a different offence that I need to copy in a new column called "offence2", after this I need to drop the false new occasion, in order to keep only rows that represent new occasions. My real data have up to 8 different offences for a single date, but I made a simpler example.
This are an example of how my data looks like
id<-c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5)
dstart<-c("25/11/2006", "13/12/2006","13/12/2006","07/02/2006","07/02/2006",
"15/01/2006", "22/03/2006","18/09/2006", "04/03/2006","04/03/2006",
"22/08/2006","22/08/2006","11/04/2006", "11/04/2006", "19/10/2006")
dstart1<-as.Date(dstart, "%d/%m/%Y")
offence<-c("a","b","c","b","d","a","a","e","b","a","c","a","a","b","a")
cod_offence<-c(25, 26,27,26,28,25,25,29,26,25,27,25,25,26,25)
mydata<-data.frame(id, dstart1, offence, cod_offence)
Data
id dstart1 offence cod_offence
1 1 2006-11-25 a 25
2 1 2006-12-13 b 26
3 1 2006-12-13 c 27
4 2 2006-02-07 b 26
5 2 2006-02-07 d 28
6 3 2006-01-15 a 25
7 3 2006-03-22 a 25
8 3 2006-09-18 e 29
9 4 2006-03-04 b 26
10 4 2006-03-04 a 25
11 4 2006-08-22 c 27
12 4 2006-08-22 a 25
13 5 2006-04-11 a 25
14 5 2006-04-11 b 26
15 5 2006-10-19 a 25
I need something like this:
id dstart1 offence cod_offence offence2
1 1 2006-11-25 a 25 NA
2 1 2006-12-13 b 26 c
3 1 2006-12-13 c 27 NA
4 2 2006-02-07 b 26 d
5 2 2006-02-07 d 28 NA
6 3 2006-01-15 a 25 NA
7 3 2006-03-22 a 25 NA
8 3 2006-09-18 e 29 NA
9 4 2006-03-04 b 26 a
10 4 2006-03-04 a 25 NA
11 4 2006-08-22 c 27 a
12 4 2006-08-22 a 25 NA
13 5 2006-04-11 a 25 b
14 5 2006-04-11 b 26 NA
15 5 2006-10-19 a 25 NA
I think that I need to do something like this: given i=individual j=observation within individual
for each individual I need to check whether mydata$dstart1(j) = mydata$dstart1(j+1)
if this is true, then copy mydata$offence2(j)=mydata$offence(j+1), otherwise keep the same value
This has to stop if id(j) != id(j+1) and re-start with the new id.
My problem is that I don't know how to put this in a loop.
Thank you!!
Update
Yes, it'w works fine with the example, but not yet with my real data, since they are a little bit more complex What happen if instead of two repeated dates I have three or more? each one of them with different offences. Following @CathG solution, I need to create more variables according to the number of offences (in my case 8), I guess I would need a new vector that identify the position of the observation within id and a new "instruction" that tell R that depending of the position of the mydata$dstart1, the value need to be copied in a different column. But then again, I don't know how to do it.
id dstart1 offence cod_offence offence2 offence3 offence4
1 1 2006-11-25 a 25 NA NA NA
2 1 2006-12-13 b 26 c NA NA
3 1 2006-12-13 c 27 NA NA NA
4 2 2006-02-07 b 26 d NA NA
5 2 2006-02-07 d 28 NA NA NA
6 2 2006-04-12 b 26 d c a
7 2 2006-04-12 d 28 NA NA NA
8 2 2006-04-12 c 27 NA NA NA
9 2 2006-04-12 a 25 NA NA NA
Thanks again!!!
Upvotes: 1
Views: 7315
Reputation: 24074
With split
and a loop :
# data with repeated dates /offences
id<-c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5)
dstart<-c("25/11/2006", "13/12/2006","13/12/2006","07/02/2006","07/02/2006",
"15/01/2006", "22/03/2006","18/09/2006", "04/03/2006","04/03/2006",
"22/08/2006","22/08/2006","11/04/2006", "11/04/2006", "19/10/2006","19/10/2006","19/10/2006","19/10/2006")
dstart1<-as.Date(dstart, "%d/%m/%Y")
offence<-c("a","b","c","b","d","a","a","e","b","a","c","a","a","b","a","c","b","a")
cod_offence<-c(25, 26,27,26,28,25,25,29,26,25,27,25,25,26,25,27,25,25)
mydata<-data.frame(id, dstart1, offence, cod_offence)
# see the max offences there are for same id and date
maxoff<-max(table(mydata$id,mydata$dstart1))
mydata[,paste("offence",2:maxoff,sep="")]<-NA
# split your data according to id
splitmydata<-split(mydata,mydata$id)
# for each "per id dataset", apply a function that looks for repeated offences / dates and fill the "offences" variables in the row with first occurence of specific date.
splitmydata2<-lapply(splitmydata,
function(tab){
for(datestart in unique(tab[,"dstart1"])){
ind_date<-sort(which(tab[,"dstart1"]==datestart))
if(length(ind_date[-1])){
tab[ind_date[1],grep("^offence",colnames(tab),value=T)[2:(length(ind_date))]]<-as.character(tab[ind_date[-1],"offence"])
}
}
return(tab)
}
)
mydata2<-unsplit(splitmydata2,mydata$id) # finally, unsplit your data
> mydata2
id dstart1 offence cod_offence offence2 offence3 offence4
1 1 2006-11-25 a 25 <NA> <NA> <NA>
2 1 2006-12-13 b 26 c <NA> <NA>
3 1 2006-12-13 c 27 <NA> <NA> <NA>
4 2 2006-02-07 b 26 d <NA> <NA>
5 2 2006-02-07 d 28 <NA> <NA> <NA>
6 3 2006-01-15 a 25 <NA> <NA> <NA>
7 3 2006-03-22 a 25 <NA> <NA> <NA>
8 3 2006-09-18 e 29 <NA> <NA> <NA>
9 4 2006-03-04 b 26 a <NA> <NA>
10 4 2006-03-04 a 25 <NA> <NA> <NA>
11 4 2006-08-22 c 27 a <NA> <NA>
12 4 2006-08-22 a 25 <NA> <NA> <NA>
13 5 2006-04-11 a 25 b <NA> <NA>
14 5 2006-04-11 b 26 <NA> <NA> <NA>
15 5 2006-10-19 a 25 c b a
16 5 2006-10-19 c 27 <NA> <NA> <NA>
17 5 2006-10-19 b 25 <NA> <NA> <NA>
18 5 2006-10-19 a 25 <NA> <NA> <NA>
Upvotes: 2
Reputation: 887148
You can use base R
indx <- with(mydata, ave(as.numeric(dstart1), id,
FUN=function(x) c(x[-1]==x[-length(x)], FALSE)))
transform(mydata, offence2=ifelse(!!indx,
c(as.character(offence[-1]), NA), NA))
Or using dplyr
library(dplyr)
mydata %>%
group_by(id) %>%
mutate(offence2= dstart1==lead(dstart1),
offence2= ifelse(!is.na(offence2)&offence2,
as.character(lead(offence)), NA_character_))
# id dstart1 offence cod_offence offence2
#1 1 2006-11-25 a 25 NA
#2 1 2006-12-13 b 26 c
#3 1 2006-12-13 c 27 NA
#4 2 2006-02-07 b 26 d
#5 2 2006-02-07 d 28 NA
#6 3 2006-01-15 a 25 NA
#7 3 2006-03-22 a 25 NA
#8 3 2006-09-18 e 29 NA
#9 4 2006-03-04 b 26 a
#10 4 2006-03-04 a 25 NA
#11 4 2006-08-22 c 27 a
#12 4 2006-08-22 a 25 NA
#13 5 2006-04-11 a 25 b
#14 5 2006-04-11 b 26 NA
#15 5 2006-10-19 a 25 NA
or using data.table
library(data.table)
setDT(mydata)[, indx:=c(dstart1[-1]==dstart1[-.N], FALSE), by=id][,
offence2:=ifelse(indx, as.character(offence)[which(indx)+1],
NA_character_), by=id][,indx:=NULL]
mydata
# id dstart1 offence cod_offence offence2
#1: 1 2006-11-25 a 25 NA
#2: 1 2006-12-13 b 26 c
#3: 1 2006-12-13 c 27 NA
#4: 2 2006-02-07 b 26 d
#5: 2 2006-02-07 d 28 NA
#6: 3 2006-01-15 a 25 NA
#7: 3 2006-03-22 a 25 NA
#8: 3 2006-09-18 e 29 NA
#9: 4 2006-03-04 b 26 a
#10: 4 2006-03-04 a 25 NA
#11: 4 2006-08-22 c 27 a
#12: 4 2006-08-22 a 25 NA
#13: 5 2006-04-11 a 25 b
#14: 5 2006-04-11 b 26 NA
#15: 5 2006-10-19 a 25 NA
Using the new dataset mydata2
and if you use the first method, we get d1
indx <- with(mydata2, ave(as.numeric(dstart1), id,
FUN=function(x) c(x[-1]==x[-length(x)], FALSE)))
d1 <- transform(mydata2, offence2=ifelse(!!indx,
c(as.character(offence[-1]), NA), NA))
From d1
, we can create an indx
column and then use dcast
to convert from long
form to wide
for the column offence2
. If there are columns with all NAs
, we can remove that by using colSums(is.na(
. Rename the columns, and then use mutate_each
from dplyr
to sort the columns, and finally cbind
it with mydata2
d1$indx <- with(d1, ave(seq_along(id), id, dstart1, FUN=seq_along))
library(reshape2)
d2 <- dcast(d1, id + dstart1+indx~indx, value.var='offence2')
d2New <- d2[,colSums(is.na(d2))!=nrow(d2)]
nm1 <- grep("^\\d",colnames(d2New))
colnames(d2New)[nm1] <- paste0('offence', 2:(length(nm1)+1))
d3 <- d2New[,-3] %>%
group_by(id, dstart1) %>%
mutate_each(funs(.[order(.)])) %>%
ungroup()
cbind(mydata,d3[,-c(1:2)])
# id dstart1 offence cod_offence offence2 offence3 offence4
#1 1 2006-11-25 a 25 <NA> <NA> <NA>
#2 1 2006-12-13 b 26 c <NA> <NA>
#3 1 2006-12-13 c 27 <NA> <NA> <NA>
#4 2 2006-02-07 b 26 d <NA> <NA>
#5 2 2006-02-07 d 28 <NA> <NA> <NA>
#6 2 2006-04-12 b 26 d c a
#7 2 2006-04-12 d 28 <NA> <NA> <NA>
#8 2 2006-04-12 c 27 <NA> <NA> <NA>
#9 2 2006-04-12 a 25 <NA> <NA> <NA>
mydata <- structure(list(id = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5,
5, 5), dstart1 = structure(c(13477, 13495, 13495, 13186, 13186,
13163, 13229, 13409, 13211, 13211, 13382, 13382, 13249, 13249,
13440), class = "Date"), offence = structure(c(1L, 2L, 3L, 2L,
4L, 1L, 1L, 5L, 2L, 1L, 3L, 1L, 1L, 2L, 1L), .Label = c("a",
"b", "c", "d", "e"), class = "factor"), cod_offence = c(25, 26,
27, 26, 28, 25, 25, 29, 26, 25, 27, 25, 25, 26, 25)), .Names = c("id",
"dstart1", "offence", "cod_offence"), row.names = c(NA, -15L),
class = "data.frame")
mydata2 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L),
dstart1 = structure(c(13477, 13495, 13495, 13186, 13186, 13250, 13250,
13250, 13250), class = "Date"), offence = c("a", "b", "c", "b", "d", "b",
"d", "c", "a"), cod_offence = c(25L, 26L, 27L, 26L, 28L, 26L, 28L, 27L, 25L
)), .Names = c("id", "dstart1", "offence", "cod_offence"), row.names =
c("1","2", "3", "4", "5", "6", "7", "8", "9"), class = "data.frame")
Upvotes: 1