Reputation: 4309
In order to explain my problem let me introduce my dataset.
This is a sample of my database.
idno
is the unique identifier, day
is the day of observation, clockst
is start of the day, start
is the observation start, end
the end of observation and time
the duration of each episode.
In this database each idno is observed for two days, so 2880 minutes in total.
idno day clockst start end time
1 1 Monday 1.30 1290 1310 20
2 1 Monday 1.50 1310 1320 10
3 1 Monday 2.00 1320 1440 120
4 1 Monday 4.00 0 385 385
5 1 Monday 10.25 385 405 20
6 1 Monday 10.45 405 450 45
7 1 Monday 11.30 450 485 35
8 1 Monday 12.05 485 495 10
9 1 Monday 12.15 495 515 20
10 1 Monday 12.35 515 600 85
11 1 Monday 14.00 600 615 15
12 1 Monday 14.15 615 640 25
13 1 Monday 14.40 640 705 65
14 1 Monday 15.45 705 710 5
15 1 Monday 15.50 710 725 15
My main issue is that day
/clockst
conflict with start
. For some reasons, the individuals have been observed before the experiment started, so basically all the episodes before start == 0
. (start == 0
indicates the start of the experiment).
For instance, rows 1—3 are observed before start == 0
.
What I would like is to remove these rows. However, I don't know how to uniquely identify each idno/day because of the following problem.
Take a look at another part of the sample
idno day clockst start end time
30 1 Monday 21.10 1030 1055 25
31 1 Monday 21.35 1055 1110 55
32 1 Monday 22.30 1110 1155 45
33 1 Monday 23.15 1155 1170 15
34 1 Monday 23.30 1170 1290 120
35 1 Tuesday 0.15 1215 1310 95
36 1 Tuesday 1.50 1310 1320 10
37 1 Tuesday 2.00 1320 1440 120
38 1 Tuesday 4.00 0 385 385
39 1 Tuesday 10.25 385 405 20
40 1 Tuesday 10.45 405 450 45
You can see here that the day and the end of the experiment are not the same. If you look at rows 33—34, you can see that the experiment continues from the Monday to Tuesday.
This is why I am having trouble identifying uniquely the observation time by idno and day.
Ultimately, what I would like is this
idno day clockst start end time ep day_obs select
1 1 Monday 1.30 1290 1310 20 NA 1 remove
2 1 Monday 1.50 1310 1320 10 NA 1 remove
3 1 Monday 2.00 1320 1440 120 NA 1 remove
4 1 Monday 4.00 0 385 385 1 1 keep
5 1 Monday 10.25 385 405 20 2 1 keep
6 1 Monday 10.45 405 450 45 3 1 keep
7 1 Monday 11.30 450 485 35 4 1 keep
8 1 Monday 12.05 485 495 10 5 1 keep
9 1 Monday 12.15 495 515 20 6 1 keep
10 1 Monday 12.35 515 600 85 7 1 keep
11 1 Monday 14.00 600 615 15 8 1 keep
I would like to create a variable called ep
incrementing the observation time, and having missing on the "pre-observation" period, in order for me to remove the NA rows.
I also would like to create a variable of the observation days (day_obs
) that is not based on the day
but on the start of the experiment.
Any idea how I could do that?
I put a subsample of the data and the output on a csv file
library(RCurl)
df <- getURL("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_subsample.csv")
df <- read.csv(text = df)
output <- getURL("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_sample_output.csv")
output <- read.csv(text = output)
Upvotes: 0
Views: 97
Reputation: 42544
The OP has requested to remove all rows before the first occurrence of start == 0
for each idno
. This is equivalent to keeping all rows after (and including) the first appearance of start == 0
for each idno
.
This can be achieved using data.table
:
library(data.table)
df <- fread("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_subsample.csv")
df2 <- df[df[, .I[seq(first(which(start == 0)), .N)], by = idno]$V1]
df2
idno day clockst start end time 1: 1 Monday 4.00 0 385 385 2: 1 Monday 10.25 385 405 20 3: 1 Monday 10.45 405 450 45 4: 1 Monday 11.30 450 485 35 5: 1 Monday 12.05 485 495 10 --- 98: 2 Tuesday 22.00 1080 1090 10 99: 2 Tuesday 22.10 1090 1135 45 100: 2 Tuesday 22.55 1135 1145 10 101: 2 Tuesday 23.05 1145 1170 25 102: 2 Tuesday 23.30 1170 1260 90
df[, .I[seq(first(which(start == 0)), .N)], by = idno]
returns the row indices of the rows to keep for each idno
. These are used for subsequent subsetting of df
idno V1 1: 1 4 2: 1 5 3: 1 6 4: 1 7 5: 1 8 --- 98: 2 104 99: 2 105 100: 2 106 101: 2 107 102: 2 108
In addition to filtering the data, the OP has requested to
day_obs
for each idno
which is advanced each time start == 0
is encountered andday_obs
and idno
(which the OP calls somewhat misleading incrementing the observation time).day_obs
can be created in the unfiltered original df
by
df[, day_obs := cumsum(start == 0 ), by = idno]
df
idno day clockst start end time day_obs 1: 1 Monday 1.30 1290 1310 20 0 2: 1 Monday 1.50 1310 1320 10 0 3: 1 Monday 2.00 1320 1440 120 0 4: 1 Monday 4.00 0 385 385 1 5: 1 Monday 10.25 385 405 20 1 --- 104: 2 Tuesday 22.00 1080 1090 10 2 105: 2 Tuesday 22.10 1090 1135 45 2 106: 2 Tuesday 22.55 1135 1145 10 2 107: 2 Tuesday 23.05 1145 1170 25 2 108: 2 Tuesday 23.30 1170 1260 90 2
Note that the rows before the first occurrence of start == 0
can be filtered by day_obs == 0
.
Now, we can easily create ep
as requested by
df[day_obs > 0, ep := rowid(idno, day_obs)]
df
idno day clockst start end time day_obs ep 1: 1 Monday 1.30 1290 1310 20 0 NA 2: 1 Monday 1.50 1310 1320 10 0 NA 3: 1 Monday 2.00 1320 1440 120 0 NA 4: 1 Monday 4.00 0 385 385 1 1 5: 1 Monday 10.25 385 405 20 1 2 --- 104: 2 Tuesday 22.00 1080 1090 10 2 18 105: 2 Tuesday 22.10 1090 1135 45 2 19 106: 2 Tuesday 22.55 1135 1145 10 2 20 107: 2 Tuesday 23.05 1145 1170 25 2 21 108: 2 Tuesday 23.30 1170 1260 90 2 22
As download links may break in the future, here is the dput()
of the downloaded data:
df <- setDT(
structure(list(idno = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), day = c("Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday"), clockst = c(1.3, 1.5, 2, 4, 10.25, 10.45, 11.3, 12.05,
12.15, 12.35, 14, 14.15, 14.4, 15.45, 15.5, 16.05, 16.1, 16.3,
16.5, 17.4, 17.55, 18.25, 19, 19.2, 19.35, 20, 20.25, 20.3, 20.55,
21.1, 21.35, 22.3, 23.15, 23.3, 0.15, 1.5, 2, 4, 10.25, 10.45,
11.3, 12, 12.1, 12.3, 12.5, 13.05, 13.55, 14.3, 14.5, 15.5, 16,
16.1, 17.4, 17.55, 18.25, 18.3, 20, 20.15, 20.2, 20.55, 21.1,
21.3, 21.45, 22, 23.15, 0.1, 1, 1.1, 4, 12.05, 12.25, 13, 13.2,
14.3, 15.35, 16, 16.3, 17, 18.3, 19.3, 20, 22, 23.1, 23.3, 1,
1.15, 4, 11.05, 11.3, 11.4, 12, 12.15, 13, 14, 14.3, 15.2, 17,
17.3, 18.3, 19.3, 21.2, 21.35, 21.5, 22, 22.1, 22.55, 23.05,
23.3), start = c(1290L, 1310L, 1320L, 0L, 385L, 405L, 450L, 485L,
495L, 515L, 600L, 615L, 640L, 705L, 710L, 725L, 730L, 750L, 770L,
820L, 835L, 865L, 900L, 920L, 935L, 960L, 985L, 990L, 1015L,
1030L, 1055L, 1110L, 1155L, 1170L, 1215L, 1310L, 1320L, 0L, 385L,
405L, 450L, 480L, 490L, 510L, 530L, 545L, 595L, 630L, 650L, 710L,
720L, 730L, 820L, 835L, 865L, 870L, 960L, 975L, 980L, 1015L,
1030L, 1050L, 1065L, 1080L, 1155L, 1210L, 1260L, 1270L, 0L, 485L,
505L, 540L, 560L, 630L, 695L, 720L, 750L, 780L, 870L, 930L, 960L,
1080L, 1150L, 1170L, 1260L, 1275L, 0L, 425L, 450L, 460L, 480L,
495L, 540L, 600L, 630L, 680L, 780L, 810L, 870L, 930L, 1040L,
1055L, 1070L, 1080L, 1090L, 1135L, 1145L, 1170L), end = c(1310L,
1320L, 1440L, 385L, 405L, 450L, 485L, 495L, 515L, 600L, 615L,
640L, 705L, 710L, 725L, 730L, 750L, 770L, 820L, 835L, 865L, 900L,
920L, 935L, 960L, 985L, 990L, 1015L, 1030L, 1055L, 1110L, 1155L,
1170L, 1290L, 1310L, 1320L, 1440L, 385L, 405L, 450L, 480L, 490L,
510L, 530L, 545L, 595L, 630L, 650L, 710L, 720L, 730L, 820L, 835L,
865L, 870L, 960L, 975L, 980L, 1015L, 1030L, 1050L, 1065L, 1080L,
1155L, 1215L, 1260L, 1270L, 1440L, 485L, 505L, 540L, 560L, 630L,
695L, 720L, 750L, 780L, 870L, 930L, 960L, 1080L, 1150L, 1170L,
1210L, 1275L, 1440L, 425L, 450L, 460L, 480L, 495L, 540L, 600L,
630L, 680L, 780L, 810L, 870L, 930L, 1040L, 1055L, 1070L, 1080L,
1090L, 1135L, 1145L, 1170L, 1260L), time = c(20L, 10L, 120L,
385L, 20L, 45L, 35L, 10L, 20L, 85L, 15L, 25L, 65L, 5L, 15L, 5L,
20L, 20L, 50L, 15L, 30L, 35L, 20L, 15L, 25L, 25L, 5L, 25L, 15L,
25L, 55L, 45L, 15L, 120L, 95L, 10L, 120L, 385L, 20L, 45L, 30L,
10L, 20L, 20L, 15L, 50L, 35L, 20L, 60L, 10L, 10L, 90L, 15L, 30L,
5L, 90L, 15L, 5L, 35L, 15L, 20L, 15L, 15L, 75L, 60L, 50L, 10L,
170L, 485L, 20L, 35L, 20L, 70L, 65L, 25L, 30L, 30L, 90L, 60L,
30L, 120L, 70L, 20L, 40L, 15L, 165L, 425L, 25L, 10L, 20L, 15L,
45L, 60L, 30L, 50L, 100L, 30L, 60L, 60L, 110L, 15L, 15L, 10L,
10L, 45L, 10L, 25L, 90L)), .Names = c("idno", "day", "clockst",
"start", "end", "time"), row.names = c(NA, -108L), class = "data.frame", index = structure(integer(0), "`__start`" = c(4L,
38L, 69L, 87L, 5L, 39L, 6L, 40L, 88L, 7L, 41L, 89L, 90L, 42L,
91L, 8L, 70L, 43L, 9L, 92L, 71L, 44L, 10L, 45L, 72L, 93L, 46L,
73L, 47L, 11L, 94L, 12L, 48L, 74L, 95L, 13L, 49L, 96L, 75L, 14L,
15L, 50L, 51L, 76L, 16L, 17L, 52L, 18L, 77L, 19L, 78L, 97L, 98L,
20L, 53L, 21L, 54L, 22L, 55L, 56L, 79L, 99L, 23L, 24L, 80L, 100L,
25L, 26L, 57L, 81L, 58L, 59L, 27L, 28L, 29L, 60L, 30L, 61L, 101L,
62L, 31L, 102L, 63L, 103L, 64L, 82L, 104L, 105L, 32L, 106L, 107L,
83L, 33L, 65L, 34L, 84L, 108L, 66L, 35L, 67L, 85L, 68L, 86L,
1L, 2L, 36L, 3L, 37L), "`__idno`" = integer(0)))
)
Upvotes: 2