Reputation: 11
I've come up against a wall in trying to resolve this and hope somebody can help. I'm trying to implement a way to filter this dataset which reflects bike station occupancy data that is time stamped.
ID Time Bike.Availability
1 2 01/04/2020 04:31:16 11
2 2 01/04/2020 04:40:07 11
3 2 01/04/2020 04:50:15 10
4 2 01/04/2020 04:57:10 10
5 2 01/04/2020 05:07:19 9
6 2 01/04/2020 05:19:38 10
7 2 01/04/2020 05:29:47 10
8 2 01/04/2020 06:43:54 11
I want to remove the rows where there is no change in Bike.Availability and only keep the first instance. I would like the resulting dataset to look as follows:
ID Time Bike.Availability
1 2 01/04/2020 04:31:16 11
2 2 01/04/2020 04:50:15 10
3 2 01/04/2020 05:07:19 9
4 2 01/04/2020 05:19:38 10
5 2 01/04/2020 06:43:54 11
I've converted the timestamp:
bike_data$Time <- as.POSIXct(bike_data$Time,format="%Y-%m-%d %H:%M:%S")
And I've tried different variations of:
library(dplyr)
bike_data %>%
group_by(Time) %>%
arrange(Bike.Availability) %>%
top_n(1)
Any help or feedback would be greatly appreciated.
Upvotes: 1
Views: 39
Reputation: 79144
A dplyr
solution alone. Checking if row above and below are same ifelse
. Then NA
to 0
and then filter.
library(dplyr)
bike_data %>%
mutate(same = ifelse(Bike.Availability == lag(Bike.Availability), 1, 0)) %>%
mutate(same = ifelse(is.na(same), 0, same)) %>%
filter(same=="NA" | same==0) %>%
select(-same)
Output:
ID Time Bike.Availability
1 2 01/04/2020 04:31:16 11
3 2 01/04/2020 04:50:15 10
5 2 01/04/2020 05:07:19 9
6 2 01/04/2020 05:19:38 10
8 2 01/04/2020 06:43:54 11
Upvotes: 2
Reputation: 887501
We group by the 'ID' and run-length-id of 'Bike.Availability' i.e. it creates a grouping index based on the similarity of adjacent elements of 'Bike.Availability', then slice
the first row with slice_head
specifying n = 1
library(dplyr)
library(data.table)
bike_data %>%
group_by(ID, grp = rleid(Bike.Availability)) %>%
slice_head(n = 1) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 3
# ID Time Bike.Availability
# <int> <chr> <int>
#1 2 01/04/2020 04:31:16 11
#2 2 01/04/2020 04:50:15 10
#3 2 01/04/2020 05:07:19 9
#4 2 01/04/2020 05:19:38 10
#5 2 01/04/2020 06:43:54 11
Grouping by 'Time' column would create groups with single observation per group (based on the values showed in 'Time'), thererefore top_n(1)
returns the original dataset instead of subsetting
bike_data <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
Time = c("01/04/2020 04:31:16",
"01/04/2020 04:40:07", "01/04/2020 04:50:15", "01/04/2020 04:57:10",
"01/04/2020 05:07:19", "01/04/2020 05:19:38", "01/04/2020 05:29:47",
"01/04/2020 06:43:54"), Bike.Availability = c(11L, 11L, 10L,
10L, 9L, 10L, 10L, 11L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
Upvotes: 3