Reputation: 7526
My objective is to use data.table
to count the number of bikes that have left a station, and then to aggregate by station_id
, hour
, and date
.
If the previous record - current record of bikes_available
is positive, then that is the amount of bikes lost. If the previous record - current record is negative or zero, then that means the number of bikes has stayed the same or increased, so these cases should not be counted.
> head(dat, n = 10)
station_id bikes_available time date hour
1: 3 2 2018-01-15 01:58:02 2018-01-15 1
2: 3 1 2018-01-15 01:59:01 2018-01-15 1
3: 3 1 2018-01-15 02:00:03 2018-01-15 2
4: 3 4 2018-01-15 02:01:02 2018-01-15 2
5: 3 4 2018-01-15 02:02:02 2018-01-15 2
6: 3 1 2018-01-15 02:03:02 2018-01-15 2
7: 3 1 2018-01-15 02:04:02 2018-01-15 2
8: 3 1 2018-01-15 02:05:02 2018-01-15 2
9: 3 7 2018-01-15 02:06:02 2018-01-15 2
10: 3 3 2018-01-15 02:07:02 2018-01-15 2
The lead
function can be used to find the difference between the previous and current record and then only positive values filtered out with:
dat[,ba_lead:=shift(bikes_available, 1, type='lead')]
dat$diff <- dat$bikes_available - dat$ba_lead
But how would one then group by 3 variables - station_id
time
and date
using data.table
?
For example, the following output would be expected from the data provided
> output
station_id bikes_taken hour date
1 3 1 1 2018-01-15
2 3 7 2 2018-01-15
3 4 4 1 2018-01-15
4 4 1 2 2018-01-15
5 5 0 1 2018-01-15
6 5 2 2 2018-01-15
(full dataset below)
> dput(dat)
structure(list(station_id = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L), bikes_available = c(2, 1, 1, 4, 4, 1,
1, 1, 7, 3, 4, 0, 0, 0, 0, 0, 1, 1, 1, 0, 5, 5, 5, 5, 4, 4, 4,
4, 3, 3), time = structure(c(1516010282, 1516010341, 1516010403,
1516010462, 1516010522, 1516010582, 1516010642, 1516010702, 1516010762,
1516010822, 1516010282, 1516010341, 1516010403, 1516010462, 1516010522,
1516010582, 1516010642, 1516010702, 1516010762, 1516010822, 1516010282,
1516010341, 1516010403, 1516010462, 1516010522, 1516010582, 1516010642,
1516010702, 1516010762, 1516010822), class = c("POSIXct", "POSIXt"
), tzone = ""), date = structure(c(17546, 17546, 17546, 17546,
17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546,
17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546,
17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546), class = "Date"),
hour = c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L)), .Names = c("station_id", "bikes_available",
"time", "date", "hour"), row.names = c(NA, -30L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x102800778>)
Upvotes: 3
Views: 666
Reputation: 83215
Another take with data.table
:
dat[, .(bikes_taken = diff(bikes_available)), by = .(station_id, date, hour)
][bikes_taken <= 0, .(bikes_taken = sum(bikes_taken*-1)), by = .(station_id, date, hour)]
which gives:
station_id date hour bikes_taken 1: 3 2018-01-15 1 1 2: 3 2018-01-15 2 7 3: 4 2018-01-15 1 4 4: 4 2018-01-15 2 1 5: 5 2018-01-15 1 0 6: 5 2018-01-15 2 2
Upvotes: 1
Reputation: 12713
library("data.table")
setDT(dat)
dat[,
j = .(bikes_taken = bikes_available - shift( x = bikes_available, n = 1, type = 'lead')),
by = .(station_id, date, hour)][ i = bikes_taken >= 0,
j = .(bikes_taken = sum(bikes_taken)),
by = .(station_id, date, hour)]
# station_id date hour bikes_taken
# 1: 3 2018-01-15 1 1
# 2: 3 2018-01-15 2 7
# 3: 4 2018-01-15 1 4
# 4: 4 2018-01-15 2 1
# 5: 5 2018-01-15 1 0
# 6: 5 2018-01-15 2 2
Upvotes: 2
Reputation: 2496
using tidyverse
functions, you can try :
df %>%
group_by(station_id, date, hour) %>%
mutate( b_taken = bikes_available - lead(bikes_available)) %>%
filter(b_taken >= 0) %>%
mutate(b_taken = sum(b_taken)) %>%
select(b_taken) %>%
unique()
which gives:
station_id date hour b_taken
<int> <date> <int> <dbl>
1 3 2018-01-15 1 1
2 3 2018-01-15 2 7
3 4 2018-01-15 1 4
4 4 2018-01-15 2 1
5 5 2018-01-15 1 0
6 5 2018-01-15 2 2
Upvotes: 1