bmora
bmora

Reputation: 77

copy values of a column into another column based on a condition using a loop

I need to create a complicated "for" loop, but after reading some examples I'm still clueless of how to write it in a proper R way and therefore I'm not sure whether it will work or not. I'm still an R beginner :(

I have a dataset in the long format, with different occasions, however, some occasions are not truly new ones since the date of start is the same, but have a different offence that I need to copy in a new column called "offence2", after this I need to drop the false new occasion, in order to keep only rows that represent new occasions. My real data have up to 8 different offences for a single date, but I made a simpler example.

This are an example of how my data looks like

    id<-c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5)
    dstart<-c("25/11/2006", "13/12/2006","13/12/2006","07/02/2006","07/02/2006",
     "15/01/2006", "22/03/2006","18/09/2006", "04/03/2006","04/03/2006",
     "22/08/2006","22/08/2006","11/04/2006", "11/04/2006", "19/10/2006") 
    dstart1<-as.Date(dstart, "%d/%m/%Y")

    offence<-c("a","b","c","b","d","a","a","e","b","a","c","a","a","b","a")
    cod_offence<-c(25, 26,27,26,28,25,25,29,26,25,27,25,25,26,25)

    mydata<-data.frame(id, dstart1, offence, cod_offence)

Data

       id    dstart1   offence  cod_offence
   1   1   2006-11-25       a          25
   2   1   2006-12-13       b          26
   3   1   2006-12-13       c          27
   4   2   2006-02-07       b          26
   5   2   2006-02-07       d          28
   6   3   2006-01-15       a          25
   7   3   2006-03-22       a          25
   8   3   2006-09-18       e          29
   9   4   2006-03-04       b          26
   10  4   2006-03-04       a          25
   11  4   2006-08-22       c          27
   12  4   2006-08-22       a          25
   13  5   2006-04-11       a          25
   14  5   2006-04-11       b          26
   15  5   2006-10-19       a          25

I need something like this:

      id    dstart1   offence  cod_offence   offence2
   1   1   2006-11-25       a          25       NA
   2   1   2006-12-13       b          26       c
   3   1   2006-12-13       c          27       NA
   4   2   2006-02-07       b          26       d
   5   2   2006-02-07       d          28       NA
   6   3   2006-01-15       a          25       NA
   7   3   2006-03-22       a          25       NA
   8   3   2006-09-18       e          29       NA
   9   4   2006-03-04       b          26       a
   10  4   2006-03-04       a          25       NA
   11  4   2006-08-22       c          27       a
   12  4   2006-08-22       a          25       NA
   13  5   2006-04-11       a          25       b
   14  5   2006-04-11       b          26       NA
   15  5   2006-10-19       a          25       NA

I think that I need to do something like this: given i=individual j=observation within individual

for each individual I need to check whether mydata$dstart1(j) = mydata$dstart1(j+1)
if this is true, then copy mydata$offence2(j)=mydata$offence(j+1), otherwise keep the same value
This has to stop if id(j) != id(j+1) and re-start with the new id.

My problem is that I don't know how to put this in a loop.

Thank you!!

Update

Yes, it'w works fine with the example, but not yet with my real data, since they are a little bit more complex What happen if instead of two repeated dates I have three or more? each one of them with different offences. Following @CathG solution, I need to create more variables according to the number of offences (in my case 8), I guess I would need a new vector that identify the position of the observation within id and a new "instruction" that tell R that depending of the position of the mydata$dstart1, the value need to be copied in a different column. But then again, I don't know how to do it.

     id    dstart1   offence  cod_offence   offence2   offence3  offence4
   1   1   2006-11-25       a          25       NA        NA       NA
   2   1   2006-12-13       b          26       c         NA       NA
   3   1   2006-12-13       c          27       NA        NA       NA
   4   2   2006-02-07       b          26       d         NA       NA
   5   2   2006-02-07       d          28       NA        NA       NA
   6   2   2006-04-12       b          26       d         c        a
   7   2   2006-04-12       d          28       NA        NA       NA
   8   2   2006-04-12       c          27       NA        NA       NA
   9   2   2006-04-12       a          25       NA        NA       NA

Thanks again!!!

Upvotes: 1

Views: 7315

Answers (2)

Cath
Cath

Reputation: 24074

With splitand a loop :

# data with repeated dates /offences
id<-c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5)
dstart<-c("25/11/2006", "13/12/2006","13/12/2006","07/02/2006","07/02/2006",
     "15/01/2006", "22/03/2006","18/09/2006", "04/03/2006","04/03/2006",
     "22/08/2006","22/08/2006","11/04/2006", "11/04/2006", "19/10/2006","19/10/2006","19/10/2006","19/10/2006") 
dstart1<-as.Date(dstart, "%d/%m/%Y")
offence<-c("a","b","c","b","d","a","a","e","b","a","c","a","a","b","a","c","b","a")
cod_offence<-c(25, 26,27,26,28,25,25,29,26,25,27,25,25,26,25,27,25,25)
mydata<-data.frame(id, dstart1, offence, cod_offence)

# see the max offences there are for same id and date
maxoff<-max(table(mydata$id,mydata$dstart1))
mydata[,paste("offence",2:maxoff,sep="")]<-NA

# split your data according to id
splitmydata<-split(mydata,mydata$id) 

# for each "per id dataset", apply a function that looks for repeated offences / dates and fill the "offences" variables in the row with first occurence of specific date.
splitmydata2<-lapply(splitmydata, 
                       function(tab){
                          for(datestart in unique(tab[,"dstart1"])){
                            ind_date<-sort(which(tab[,"dstart1"]==datestart))
                            if(length(ind_date[-1])){
                               tab[ind_date[1],grep("^offence",colnames(tab),value=T)[2:(length(ind_date))]]<-as.character(tab[ind_date[-1],"offence"])
                              }
                           }
                          return(tab)
                       }
                     )

mydata2<-unsplit(splitmydata2,mydata$id) # finally, unsplit your data

> mydata2
   id    dstart1 offence cod_offence offence2 offence3 offence4
1   1 2006-11-25       a          25     <NA>     <NA>     <NA>
2   1 2006-12-13       b          26        c     <NA>     <NA>
3   1 2006-12-13       c          27     <NA>     <NA>     <NA>
4   2 2006-02-07       b          26        d     <NA>     <NA>
5   2 2006-02-07       d          28     <NA>     <NA>     <NA>
6   3 2006-01-15       a          25     <NA>     <NA>     <NA>
7   3 2006-03-22       a          25     <NA>     <NA>     <NA>
8   3 2006-09-18       e          29     <NA>     <NA>     <NA>
9   4 2006-03-04       b          26        a     <NA>     <NA>
10  4 2006-03-04       a          25     <NA>     <NA>     <NA>
11  4 2006-08-22       c          27        a     <NA>     <NA>
12  4 2006-08-22       a          25     <NA>     <NA>     <NA>
13  5 2006-04-11       a          25        b     <NA>     <NA>
14  5 2006-04-11       b          26     <NA>     <NA>     <NA>
15  5 2006-10-19       a          25        c        b        a
16  5 2006-10-19       c          27     <NA>     <NA>     <NA>
17  5 2006-10-19       b          25     <NA>     <NA>     <NA>
18  5 2006-10-19       a          25     <NA>     <NA>     <NA>

Upvotes: 2

akrun
akrun

Reputation: 887148

You can use base R

indx <- with(mydata, ave(as.numeric(dstart1), id,
           FUN=function(x) c(x[-1]==x[-length(x)], FALSE)))

 transform(mydata, offence2=ifelse(!!indx, 
            c(as.character(offence[-1]), NA), NA))

Or using dplyr

library(dplyr)
mydata %>%
      group_by(id) %>% 
      mutate(offence2= dstart1==lead(dstart1), 
       offence2= ifelse(!is.na(offence2)&offence2,
         as.character(lead(offence)), NA_character_))
#     id    dstart1 offence cod_offence offence2
#1   1 2006-11-25       a          25       NA
#2   1 2006-12-13       b          26        c
#3   1 2006-12-13       c          27       NA
#4   2 2006-02-07       b          26        d
#5   2 2006-02-07       d          28       NA
#6   3 2006-01-15       a          25       NA
#7   3 2006-03-22       a          25       NA
#8   3 2006-09-18       e          29       NA
#9   4 2006-03-04       b          26        a
#10  4 2006-03-04       a          25       NA
#11  4 2006-08-22       c          27        a
#12  4 2006-08-22       a          25       NA
#13  5 2006-04-11       a          25        b
#14  5 2006-04-11       b          26       NA
#15  5 2006-10-19       a          25       NA

or using data.table

library(data.table)
setDT(mydata)[, indx:=c(dstart1[-1]==dstart1[-.N], FALSE), by=id][,
      offence2:=ifelse(indx, as.character(offence)[which(indx)+1],
                                 NA_character_), by=id][,indx:=NULL]

mydata
 #    id    dstart1 offence cod_offence offence2
 #1:  1 2006-11-25       a          25       NA
 #2:  1 2006-12-13       b          26        c
 #3:  1 2006-12-13       c          27       NA
 #4:  2 2006-02-07       b          26        d
 #5:  2 2006-02-07       d          28       NA
 #6:  3 2006-01-15       a          25       NA
 #7:  3 2006-03-22       a          25       NA
 #8:  3 2006-09-18       e          29       NA
 #9:  4 2006-03-04       b          26        a
#10:  4 2006-03-04       a          25       NA
#11:  4 2006-08-22       c          27        a
#12:  4 2006-08-22       a          25       NA
#13:  5 2006-04-11       a          25        b
#14:  5 2006-04-11       b          26       NA
#15:  5 2006-10-19       a          25       NA

Update

Using the new dataset mydata2 and if you use the first method, we get d1

 indx <- with(mydata2, ave(as.numeric(dstart1), id,
       FUN=function(x) c(x[-1]==x[-length(x)], FALSE)))

 d1 <-  transform(mydata2, offence2=ifelse(!!indx, 
                  c(as.character(offence[-1]), NA), NA))

From d1, we can create an indx column and then use dcast to convert from long form to wide for the column offence2. If there are columns with all NAs, we can remove that by using colSums(is.na(. Rename the columns, and then use mutate_each from dplyr to sort the columns, and finally cbind it with mydata2

 d1$indx <- with(d1, ave(seq_along(id), id, dstart1, FUN=seq_along))
 library(reshape2)

 d2 <- dcast(d1, id + dstart1+indx~indx, value.var='offence2')
 d2New <- d2[,colSums(is.na(d2))!=nrow(d2)]
 nm1 <-  grep("^\\d",colnames(d2New))
 colnames(d2New)[nm1] <- paste0('offence', 2:(length(nm1)+1)) 
 d3 <- d2New[,-3] %>%
                group_by(id, dstart1) %>%
                mutate_each(funs(.[order(.)])) %>%
                ungroup()

 cbind(mydata,d3[,-c(1:2)])
 #    id    dstart1 offence cod_offence offence2 offence3 offence4
 #1  1 2006-11-25       a          25     <NA>     <NA>     <NA>
 #2  1 2006-12-13       b          26        c     <NA>     <NA>
 #3  1 2006-12-13       c          27     <NA>     <NA>     <NA>
 #4  2 2006-02-07       b          26        d     <NA>     <NA>
 #5  2 2006-02-07       d          28     <NA>     <NA>     <NA>
 #6  2 2006-04-12       b          26        d        c        a
 #7  2 2006-04-12       d          28     <NA>     <NA>     <NA>
 #8  2 2006-04-12       c          27     <NA>     <NA>     <NA>
 #9  2 2006-04-12       a          25     <NA>     <NA>     <NA>

data

mydata <- structure(list(id = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 
5, 5), dstart1 = structure(c(13477, 13495, 13495, 13186, 13186, 
13163, 13229, 13409, 13211, 13211, 13382, 13382, 13249, 13249, 
13440), class = "Date"), offence = structure(c(1L, 2L, 3L, 2L, 
4L, 1L, 1L, 5L, 2L, 1L, 3L, 1L, 1L, 2L, 1L), .Label = c("a", 
"b", "c", "d", "e"), class = "factor"), cod_offence = c(25, 26, 
27, 26, 28, 25, 25, 29, 26, 25, 27, 25, 25, 26, 25)), .Names = c("id", 
"dstart1", "offence", "cod_offence"), row.names = c(NA, -15L), 
class = "data.frame")

mydata2 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L),
dstart1 = structure(c(13477, 13495, 13495, 13186, 13186, 13250, 13250,
 13250, 13250), class = "Date"), offence = c("a", "b", "c", "b", "d", "b",
"d", "c", "a"), cod_offence = c(25L, 26L, 27L, 26L, 28L, 26L, 28L, 27L, 25L
)), .Names = c("id", "dstart1", "offence", "cod_offence"), row.names =
 c("1","2", "3", "4", "5", "6", "7", "8", "9"), class = "data.frame")

Upvotes: 1

Related Questions