Jan Blanke
Jan Blanke

Reputation: 181

Selecting specific rows in R

I am working on gps data right now, the position of the animal has been collected if possible every 4 hours. The data looks like this (XY data is not shown here for some reasons):

  ID  TIME           POSIXTIME  date_only
1   1 12:00 2005-05-08 12:00:00 2005-05-08
2   2 16:01 2005-05-08 16:01:00 2005-05-08
3   3 20:01 2005-05-08 20:01:00 2005-05-08
4   4  0:01 2005-05-09 00:01:00 2005-05-09
5   5  8:01 2005-05-09 08:01:00 2005-05-09
6   6 12:01 2005-05-09 12:01:00 2005-05-09
7   7 16:02 2005-05-09 16:02:00 2005-05-09
8   8 20:02 2005-05-09 20:02:00 2005-05-09
9   9  0:01 2005-05-10 00:01:00 2005-05-10
10 10  4:00 2005-05-10 04:00:00 2005-05-10

I would now like to take only the first locations per day. In most cases, this will be at 0:01 o'clock. However, sometimes it will be 4:01 or even later as there is missing data. How can I get only the first locations per day? They should be included in a new dataframe. I tried it with :

tapply(as.numeric(Kandularaw$TIME),list(Kandularaw$date_only),min, na.rm=T)

However, this did not work as R takes strange values when TIME is set as numeric. Is it possible do do it with an ifelse statement? If yes, how would it look like roughly? I am grateful for every help I can get. Thank you for your efforts.

Cheers,

Jan

Upvotes: 1

Views: 2174

Answers (2)

Gavin Simpson
Gavin Simpson

Reputation: 174813

I would approach this from a simpler point of view. First, ensure that POSIXTIME is one of the "POSIX" classes. Then order the data by POSIXTIME. At this point we can use any of the split-apply-combine idioms to do what you want, making use of the head() function. Here I use aggregate():

Using this example data set:

dat <- structure(list(ID = 1:10, TIME = structure(c(4L, 6L, 8L, 1L, 
3L, 5L, 7L, 9L, 1L, 2L), .Label = c("00:01:00", "04:00:00", "08:01:00", 
"12:00:00", "12:01:00", "16:01:00", "16:02:00", "20:01:00", "20:02:00"
), class = "factor"), POSIXTIME = structure(1:10, .Label = c("2005/05/08 12:00:00", 
"2005/05/08 16:01:00", "2005/05/08 20:01:00", "2005/05/09 00:01:00", 
"2005/05/09 08:01:00", "2005/05/09 12:01:00", "2005/05/09 16:02:00", 
"2005/05/09 20:02:00", "2005/05/10 00:01:00", "2005/05/10 04:00:00"
), class = "factor"), date_only = structure(c(1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L), .Label = c("2005/05/08", "2005/05/09", 
"2005/05/10"), class = "factor")), .Names = c("ID", "TIME", "POSIXTIME", 
"date_only"), class = "data.frame", row.names = c(NA, 10L))

First, get POSIXTIME and date_only in the correct formats:

dat <- transform(dat,
                 POSIXTIME = as.POSIXct(POSIXTIME, format = "%Y/%m/%d %H:%M:%S"),
                 date_only = as.Date(date_only, format = "%Y/%m/%d"))

Next, order by POSIXTIME:

dato <- with(dat, dat[order(POSIXTIME), ])

The final step is to use aggregate() to split the data by date_only and use head() to select the first row:

aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)

notice I pass the n argument of head() the value 1, indicating that it should extract only the first row of each days observations. Because we sorted by datetime and split on date, the first row should be the first observation per day. Do be aware of rounding issues however.

The final step results in:

> aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)
        date ID     TIME           POSIXTIME
1 2005-05-08  1 12:00:00 2005-05-08 12:00:00
2 2005-05-09  4 00:01:00 2005-05-09 00:01:00
3 2005-05-10  9 00:01:00 2005-05-10 00:01:00

Instead of dato[,1:3] refer to whatever columns in your original data set contain the variables (locations?) you wanted.

Upvotes: 1

IRTFM
IRTFM

Reputation: 263342

I am guessing you really want a row number as an index into a position record. If you know that these rows are ordered by date-time, and you are getting satisfactory group splits with that second argument to tapply (however it was created), then try this:

idx <- tapply(1:NROW(Kandularaw), Kandularaw$date_only, "[", 1)

If you want records (rows) in that same dataframe then just use:

Kandularaw[ idx, ]

Upvotes: 1

Related Questions