giac
giac

Reputation: 4309

Identify and increment a sequence based on ambiguous identification

In order to explain my problem let me introduce my dataset.

This is a sample of my database.

idno is the unique identifier, day is the day of observation, clockst is start of the day, start is the observation start, end the end of observation and time the duration of each episode.

In this database each idno is observed for two days, so 2880 minutes in total.

    idno    day clockst start  end time
1     1 Monday    1.30  1290 1310   20
2     1 Monday    1.50  1310 1320   10
3     1 Monday    2.00  1320 1440  120
4     1 Monday    4.00     0  385  385
5     1 Monday   10.25   385  405   20
6     1 Monday   10.45   405  450   45
7     1 Monday   11.30   450  485   35
8     1 Monday   12.05   485  495   10
9     1 Monday   12.15   495  515   20
10    1 Monday   12.35   515  600   85
11    1 Monday   14.00   600  615   15
12    1 Monday   14.15   615  640   25
13    1 Monday   14.40   640  705   65
14    1 Monday   15.45   705  710    5
15    1 Monday   15.50   710  725   15

My main issue is that day/clockst conflict with start. For some reasons, the individuals have been observed before the experiment started, so basically all the episodes before start == 0. (start == 0 indicates the start of the experiment).

For instance, rows 1—3 are observed before start == 0.

What I would like is to remove these rows. However, I don't know how to uniquely identify each idno/day because of the following problem.

Take a look at another part of the sample

   idno     day clockst start  end time
30    1  Monday   21.10  1030 1055   25
31    1  Monday   21.35  1055 1110   55
32    1  Monday   22.30  1110 1155   45
33    1  Monday   23.15  1155 1170   15
34    1  Monday   23.30  1170 1290  120
35    1 Tuesday    0.15  1215 1310   95
36    1 Tuesday    1.50  1310 1320   10
37    1 Tuesday    2.00  1320 1440  120
38    1 Tuesday    4.00     0  385  385
39    1 Tuesday   10.25   385  405   20
40    1 Tuesday   10.45   405  450   45

You can see here that the day and the end of the experiment are not the same. If you look at rows 33—34, you can see that the experiment continues from the Monday to Tuesday.

This is why I am having trouble identifying uniquely the observation time by idno and day.

Ultimately, what I would like is this

    idno     day clockst start  end time ep day_obs select
1      1  Monday    1.30  1290 1310   20 NA       1 remove
2      1  Monday    1.50  1310 1320   10 NA       1 remove
3      1  Monday    2.00  1320 1440  120 NA       1 remove
4      1  Monday    4.00     0  385  385  1       1   keep
5      1  Monday   10.25   385  405   20  2       1   keep
6      1  Monday   10.45   405  450   45  3       1   keep
7      1  Monday   11.30   450  485   35  4       1   keep
8      1  Monday   12.05   485  495   10  5       1   keep
9      1  Monday   12.15   495  515   20  6       1   keep
10     1  Monday   12.35   515  600   85  7       1   keep
11     1  Monday   14.00   600  615   15  8       1   keep

I would like to create a variable called ep incrementing the observation time, and having missing on the "pre-observation" period, in order for me to remove the NA rows.

I also would like to create a variable of the observation days (day_obs) that is not based on the day but on the start of the experiment.

Any idea how I could do that?

I put a subsample of the data and the output on a csv file

library(RCurl)
df <- getURL("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_subsample.csv")
df <- read.csv(text = df)

output <- getURL("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_sample_output.csv")
output <- read.csv(text = output)

Upvotes: 0

Views: 97

Answers (1)

Uwe
Uwe

Reputation: 42544

The OP has requested to remove all rows before the first occurrence of start == 0 for each idno. This is equivalent to keeping all rows after (and including) the first appearance of start == 0 for each idno.

Variant 1: Only subsetting

This can be achieved using data.table:

library(data.table)
df <- fread("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_subsample.csv")

df2 <- df[df[, .I[seq(first(which(start == 0)), .N)], by = idno]$V1]
df2
     idno     day clockst start  end time
  1:    1  Monday    4.00     0  385  385
  2:    1  Monday   10.25   385  405   20
  3:    1  Monday   10.45   405  450   45
  4:    1  Monday   11.30   450  485   35
  5:    1  Monday   12.05   485  495   10
 ---                                     
 98:    2 Tuesday   22.00  1080 1090   10
 99:    2 Tuesday   22.10  1090 1135   45
100:    2 Tuesday   22.55  1135 1145   10
101:    2 Tuesday   23.05  1145 1170   25
102:    2 Tuesday   23.30  1170 1260   90

Explanation

df[, .I[seq(first(which(start == 0)), .N)], by = idno]

returns the row indices of the rows to keep for each idno. These are used for subsequent subsetting of df

     idno  V1
  1:    1   4
  2:    1   5
  3:    1   6
  4:    1   7
  5:    1   8
 ---         
 98:    2 104
 99:    2 105
100:    2 106
101:    2 107
102:    2 108

Variant 2: Alternative approach and answers to additional questions

In addition to filtering the data, the OP has requested to

  • create a count of the observation days day_obs for each idno which is advanced each time start == 0 is encountered and
  • a row count for each day_obs and idno (which the OP calls somewhat misleading incrementing the observation time).

day_obs can be created in the unfiltered original df by

df[, day_obs := cumsum(start == 0 ), by = idno]
df
     idno     day clockst start  end time day_obs
  1:    1  Monday    1.30  1290 1310   20       0
  2:    1  Monday    1.50  1310 1320   10       0
  3:    1  Monday    2.00  1320 1440  120       0
  4:    1  Monday    4.00     0  385  385       1
  5:    1  Monday   10.25   385  405   20       1
 ---                                             
104:    2 Tuesday   22.00  1080 1090   10       2
105:    2 Tuesday   22.10  1090 1135   45       2
106:    2 Tuesday   22.55  1135 1145   10       2
107:    2 Tuesday   23.05  1145 1170   25       2
108:    2 Tuesday   23.30  1170 1260   90       2

Note that the rows before the first occurrence of start == 0 can be filtered by day_obs == 0.

Now, we can easily create ep as requested by

df[day_obs > 0, ep := rowid(idno, day_obs)]
df
     idno     day clockst start  end time day_obs ep
  1:    1  Monday    1.30  1290 1310   20       0 NA
  2:    1  Monday    1.50  1310 1320   10       0 NA
  3:    1  Monday    2.00  1320 1440  120       0 NA
  4:    1  Monday    4.00     0  385  385       1  1
  5:    1  Monday   10.25   385  405   20       1  2
 ---                                                
104:    2 Tuesday   22.00  1080 1090   10       2 18
105:    2 Tuesday   22.10  1090 1135   45       2 19
106:    2 Tuesday   22.55  1135 1145   10       2 20
107:    2 Tuesday   23.05  1145 1170   25       2 21
108:    2 Tuesday   23.30  1170 1260   90       2 22

Data

As download links may break in the future, here is the dput() of the downloaded data:

df <- setDT(
structure(list(idno = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L), day = c("Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday"), clockst = c(1.3, 1.5, 2, 4, 10.25, 10.45, 11.3, 12.05, 
12.15, 12.35, 14, 14.15, 14.4, 15.45, 15.5, 16.05, 16.1, 16.3, 
16.5, 17.4, 17.55, 18.25, 19, 19.2, 19.35, 20, 20.25, 20.3, 20.55, 
21.1, 21.35, 22.3, 23.15, 23.3, 0.15, 1.5, 2, 4, 10.25, 10.45, 
11.3, 12, 12.1, 12.3, 12.5, 13.05, 13.55, 14.3, 14.5, 15.5, 16, 
16.1, 17.4, 17.55, 18.25, 18.3, 20, 20.15, 20.2, 20.55, 21.1, 
21.3, 21.45, 22, 23.15, 0.1, 1, 1.1, 4, 12.05, 12.25, 13, 13.2, 
14.3, 15.35, 16, 16.3, 17, 18.3, 19.3, 20, 22, 23.1, 23.3, 1, 
1.15, 4, 11.05, 11.3, 11.4, 12, 12.15, 13, 14, 14.3, 15.2, 17, 
17.3, 18.3, 19.3, 21.2, 21.35, 21.5, 22, 22.1, 22.55, 23.05, 
23.3), start = c(1290L, 1310L, 1320L, 0L, 385L, 405L, 450L, 485L, 
495L, 515L, 600L, 615L, 640L, 705L, 710L, 725L, 730L, 750L, 770L, 
820L, 835L, 865L, 900L, 920L, 935L, 960L, 985L, 990L, 1015L, 
1030L, 1055L, 1110L, 1155L, 1170L, 1215L, 1310L, 1320L, 0L, 385L, 
405L, 450L, 480L, 490L, 510L, 530L, 545L, 595L, 630L, 650L, 710L, 
720L, 730L, 820L, 835L, 865L, 870L, 960L, 975L, 980L, 1015L, 
1030L, 1050L, 1065L, 1080L, 1155L, 1210L, 1260L, 1270L, 0L, 485L, 
505L, 540L, 560L, 630L, 695L, 720L, 750L, 780L, 870L, 930L, 960L, 
1080L, 1150L, 1170L, 1260L, 1275L, 0L, 425L, 450L, 460L, 480L, 
495L, 540L, 600L, 630L, 680L, 780L, 810L, 870L, 930L, 1040L, 
1055L, 1070L, 1080L, 1090L, 1135L, 1145L, 1170L), end = c(1310L, 
1320L, 1440L, 385L, 405L, 450L, 485L, 495L, 515L, 600L, 615L, 
640L, 705L, 710L, 725L, 730L, 750L, 770L, 820L, 835L, 865L, 900L, 
920L, 935L, 960L, 985L, 990L, 1015L, 1030L, 1055L, 1110L, 1155L, 
1170L, 1290L, 1310L, 1320L, 1440L, 385L, 405L, 450L, 480L, 490L, 
510L, 530L, 545L, 595L, 630L, 650L, 710L, 720L, 730L, 820L, 835L, 
865L, 870L, 960L, 975L, 980L, 1015L, 1030L, 1050L, 1065L, 1080L, 
1155L, 1215L, 1260L, 1270L, 1440L, 485L, 505L, 540L, 560L, 630L, 
695L, 720L, 750L, 780L, 870L, 930L, 960L, 1080L, 1150L, 1170L, 
1210L, 1275L, 1440L, 425L, 450L, 460L, 480L, 495L, 540L, 600L, 
630L, 680L, 780L, 810L, 870L, 930L, 1040L, 1055L, 1070L, 1080L, 
1090L, 1135L, 1145L, 1170L, 1260L), time = c(20L, 10L, 120L, 
385L, 20L, 45L, 35L, 10L, 20L, 85L, 15L, 25L, 65L, 5L, 15L, 5L, 
20L, 20L, 50L, 15L, 30L, 35L, 20L, 15L, 25L, 25L, 5L, 25L, 15L, 
25L, 55L, 45L, 15L, 120L, 95L, 10L, 120L, 385L, 20L, 45L, 30L, 
10L, 20L, 20L, 15L, 50L, 35L, 20L, 60L, 10L, 10L, 90L, 15L, 30L, 
5L, 90L, 15L, 5L, 35L, 15L, 20L, 15L, 15L, 75L, 60L, 50L, 10L, 
170L, 485L, 20L, 35L, 20L, 70L, 65L, 25L, 30L, 30L, 90L, 60L, 
30L, 120L, 70L, 20L, 40L, 15L, 165L, 425L, 25L, 10L, 20L, 15L, 
45L, 60L, 30L, 50L, 100L, 30L, 60L, 60L, 110L, 15L, 15L, 10L, 
10L, 45L, 10L, 25L, 90L)), .Names = c("idno", "day", "clockst", 
"start", "end", "time"), row.names = c(NA, -108L), class = "data.frame", index = structure(integer(0), "`__start`" = c(4L, 
38L, 69L, 87L, 5L, 39L, 6L, 40L, 88L, 7L, 41L, 89L, 90L, 42L, 
91L, 8L, 70L, 43L, 9L, 92L, 71L, 44L, 10L, 45L, 72L, 93L, 46L, 
73L, 47L, 11L, 94L, 12L, 48L, 74L, 95L, 13L, 49L, 96L, 75L, 14L, 
15L, 50L, 51L, 76L, 16L, 17L, 52L, 18L, 77L, 19L, 78L, 97L, 98L, 
20L, 53L, 21L, 54L, 22L, 55L, 56L, 79L, 99L, 23L, 24L, 80L, 100L, 
25L, 26L, 57L, 81L, 58L, 59L, 27L, 28L, 29L, 60L, 30L, 61L, 101L, 
62L, 31L, 102L, 63L, 103L, 64L, 82L, 104L, 105L, 32L, 106L, 107L, 
83L, 33L, 65L, 34L, 84L, 108L, 66L, 35L, 67L, 85L, 68L, 86L, 
1L, 2L, 36L, 3L, 37L), "`__idno`" = integer(0)))
)

Upvotes: 2

Related Questions