Reputation: 1315
I am not 100% sure how to formulate my question because I don't know the formal names are for what it is that I am trying to do with my dataset. Based on previous questions, there appears to be some way to address what I am trying to, but I am unable at making the logical jump from their problem to my own.
I have attached a sample of my data here.
The first thing I did with my data was add a column indicating which species (sps) are predators (coded as 1) and which species are prey (coded as 0).
#specify which are predators and prey
d1 = d1 %>%
group_by(sps) %>% #grouped by species
mutate(pp=ifelse(sps %in% c("MUXX", "MUVI","MEME"), 1,0)) #mutate to specify predators as 1 and prey as 0
My data is structured as such:
head(d1) #visualize the first few lines of the data
# A tibble: 6 x 8
# Groups: sps [4]
ID date km culv.id type sps time pp
<int> <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <dbl>
1 2012-06-19 80 A DCC MICRO 2:19 0
2 2012-06-21 80 A DCC MUXX 23:23 1
3 2012-07-15 80 A DCC MAMO 11:38 0
4 2012-07-20 80 A DCC MICRO 22:19 0
5 2012-07-29 80 A DCC MICRO 23:03 0
6 2012-08-07 80 A DCC PRLO 2:04 0
Here is also the output for dput(head(d1))
:
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
I also converted the time and date using the following code:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new format
The (most) relevant columns are date, time, and pp (where 1 = predator species and 0 = prey species).
I am now trying to figure out how to extract the following information (average +/- std):
To put one of these examples (#2) into words:
What is the average time between when a prey species (pp = 0) is first seen followed by a predator species (pp = 1)?
I am trying to figure out how to do this for my dataset overall first. I think that once I figure out how to do that, it should be fairly straightforward to restrict the data.
Upvotes: 0
Views: 99
Reputation: 10301
Here is a data.table
(and lubridate
) version that might address your problem:
Using a selection of your posted data (posted at bottom), with a slight modification to your datetime
creation so that the format works with data.table
:
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
Convert to a data table:
d1 <- as.data.table(d1)
Calculate time differences for equal pp
values for animals by specialization (prey or predator), less than (pred to prey), or greater than (prey to pred).
d1$class.class <- d1[d1, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
d1$prey.pred <-d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
d1$pred.prey <- d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
Gives you a column for each:
> head(d1[, 7:ncol(d1)])
time pp datetime class.class prey.pred pred.prey
1: 2:19 0 2012-06-19 02:19:00 26.388194 days 2.877778 days NA days
2: 23:23 1 2012-06-21 23:23:00 74.177083 days NA days 23.51042 days
3: 11:38 0 2012-07-15 11:38:00 5.445139 days 50.666667 days NA days
4: 22:19 0 2012-07-20 22:19:00 9.030556 days 45.221528 days NA days
5: 23:03 0 2012-07-29 23:03:00 8.125694 days 36.190972 days NA days
6: 2:04 0 2012-08-07 02:04:00 1.911111 days 28.065278 days NA days
And you can get summary statistics as you like:
d1[by = sps,, .(mean.same.class = mean(class.class, na.rm = TRUE),
sd.same.class = sd(class.class, na.rm = TRUE),
mean.prey.pred = mean(prey.pred, na.rm = TRUE),
sd.prey.pred = sd(prey.pred, na.rm = TRUE),
mean.pred.prey = mean(pred.prey, na.rm = TRUE),
sd.pred.prey = sd(pred.prey, na.rm = TRUE))]
sps mean.same.class sd.same.class mean.prey.pred sd.prey.pred mean.pred.prey sd.pred.prey
1: MICRO 7.886237 days 8.0547631 18.80733 days 15.504646 NaN days NA
2: MUXX 42.073611 days 45.4011658 NaN days NA 13.01366 days 9.315697
3: MAMO 5.445139 days NA 50.66667 days NA NaN days NA
4: PRLO 2.475694 days 0.7984414 26.62708 days 2.033914 NaN days NA
5: LEAM 2.897222 days NA 10.11597 days NA NaN days NA
Libraries: data.table
, lubridate
Data:
> dput(d1)
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15200L, 15201L, 15199L, 15177L, 15178L, 15204L, 15205L
), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20",
"2012-07-29", "2012-08-07", "2012-08-08", "2012-08-09", "2012-08-13",
"2012-08-13", "2012-08-25", "2012-08-27", "2012-09-04", "2012-09-09",
"2012-09-11", "2012-09-14", "2012-09-23", "2012-09-26", "2012-09-27"
), km = c(80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L,
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX",
"MAMO", "MICRO", "MICRO", "PRLO", "MICRO", "PRLO", "MICRO", "MICRO",
"LEAM", "MICRO", "MUXX", "MICRO", "MICRO", "MUXX", "MICRO", "MICRO",
"MICRO"), time = c("2:19", "23:23", "11:38", "22:19", "23:03",
"2:04", "23:56", "23:06", "0:04", "0:46", "0:51", "22:23", "3:38",
"21:08", "0:40", "2:55", "22:09", "20:46", "3:20"), pp = c(0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0)), class = "data.frame", .Names = c("ID",
"date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA,
-19L))
Edit:
I'm not really sure about mixing tidyverse
and data.table
ideologies, but you could potentially do what you described in comments using do
. For example, make a modified version of df
:
d1 <- as.data.table(d1)
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
d1Mod <- d1
d1Mod$km[10:nrow(d1Mod)] <- 90
Then, define the data.table
bit as a function:
foo <- function(df_) {
df_$class.class <- df_[df_, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
df_$prey.pred <-df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
df_$pred.prey <- df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
return(df_)
}
Running d1 %>% group_by(km) %>% do(foo(as.data.table(.)))
gets you the same output as in the original answer above (since all km
values are 80). If you run it on the modified d1Mod
you get an output that looks like it has been grouped by km
:
> d1Mod %>%
+ group_by(km) %>%
+ do(foo(as.data.table(.)))
# A tibble: 19 x 12
# Groups: km [2]
ID date km culv.id type sps time pp datetime class.class prey.pred pred.prey
<int> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dttm> <time> <time> <time>
1 1 2012-06-19 80 A DCC MICRO 2:19 0 2012-06-19 02:19:00 26.3881944 days 2.877778 days NA days
2 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23:00 NA days NA days 23.510417 days
3 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38:00 5.4451389 days NA days NA days
4 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19:00 9.0305556 days NA days NA days
5 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03:00 8.1256944 days NA days NA days
6 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04:00 1.9111111 days NA days NA days
7 9 2012-08-08 80 A DCC MICRO 23:56 0 2012-08-08 23:56:00 0.9652778 days NA days NA days
8 10 2012-08-09 80 A DCC PRLO 23:06 0 2012-08-09 23:06:00 3.0402778 days NA days NA days
9 11 2012-08-13 80 A DCC MICRO 0:04 0 2012-08-13 00:04:00 NA days NA days NA days
10 12 2012-08-13 90 A DCC MICRO 0:46 0 2012-08-13 00:46:00 12.0034722 days 22.119444 days NA days
11 13 2012-08-25 90 A DCC LEAM 0:51 0 2012-08-25 00:51:00 2.8972222 days 10.115972 days NA days
12 14 2012-08-27 90 A DCC MICRO 22:23 0 2012-08-27 22:23:00 12.9479167 days 7.218750 days NA days
13 15200 2012-09-04 90 A DCC MUXX 3:38 1 2012-09-04 03:38:00 9.9701389 days NA days 5.729167 days
14 15201 2012-09-09 90 A DCC MICRO 21:08 0 2012-09-09 21:08:00 1.1472222 days 4.240972 days NA days
15 15199 2012-09-11 90 A DCC MICRO 0:40 0 2012-09-11 00:40:00 12.8951389 days 3.093750 days NA days
16 15177 2012-09-14 90 A DCC MUXX 2:55 1 2012-09-14 02:55:00 NA days NA days 9.801389 days
17 15178 2012-09-23 90 A DCC MICRO 22:09 0 2012-09-23 22:09:00 2.9423611 days NA days NA days
18 15204 2012-09-26 90 A DCC MICRO 20:46 0 2012-09-26 20:46:00 0.2736111 days NA days NA days
19 15205 2012-09-27 90 A DCC MICRO 3:20 0 2012-09-27 03:20:00 NA days NA days NA days
However, you'll have to do some checking to make sure that the calculations are actually doing what you need- I don't have example output or actual km
/year
info to truth these results against (read: I don't know what I'm looking at!).
Note also that I think arrange
is irrelevant for the operations here, considering that the datetime
gets sorted in the function.
Upvotes: 2
Reputation: 2311
I'll use the piece on the comments as an example:
d1 = structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
We add the datetime column just as you specified:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE)
First, add a column indicating which sequence of happened prey/predator and the time between observations (we remove the first row because there is no information about the previous observation). Note that, the timedif
is a numerical value indicating the number of days.
d1 = d1 %>% mutate(prev = lag(pp))
d1 = d1 %>% mutate(timedif = as.numeric(as.POSIXct(datetime) - lag(as.POSIXct(datetime))))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
At this point, your table looks like
> d1
ID date km culv.id type sps time pp datetime prev timedif seque
1 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23 0 2.877778 10
2 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38 1 23.510417 01
3 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19 0 5.445139 00
4 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03 0 9.030556 00
5 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04 0 8.125694 00
After that, just take the wanted statistics for each group by using
avg = d1 %>% group_by(seque) %>% summarise(mean(timedif))
sdevs = d1 %>% group_by(seque) %>% summarise(sd(timedif))
We obtain
>avg
# A tibble: 3 x 2
seque `mean(timedif)`
<fctr> <dbl>
1 00 7.533796
2 01 23.510417
3 10 2.877778
> sdevs
# A tibble: 3 x 2
seque `sd(timedif)`
<fctr> <dbl>
1 00 1.864554
2 01 NA
3 10 NA
Note that the standard deviation is not computed because we only have one observation in the sample dataset for these categories.
Upvotes: 1