Reputation: 37
I have a dataframe with time and a transport mode. There are some values, that are classified in a walking part as "unknown" or false classified as an other transport mode, for example "car"
Here is the structure
travel <- structure(list(time = structure(1:19, .Label = c("16:20:20",
"16:20:21", "16:20:22", "16:20:23", "16:20:24", "16:20:25", "16:20:26",
"16:20:27", "16:20:28", "16:20:29", "16:20:30", "16:20:31", "16:20:32",
"16:20:33", "16:20:34", "16:20:35", "16:20:36", "16:20:37", "16:20:38"
), class = "factor"), mode = structure(c(3L, 3L, 3L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("car",
"bus stop", "walk"), class = "factor")), class = "data.frame", row.names = c(NA,
-19L))
Here you can see the dataframe
time mode
1 16:20:20 walk
2 16:20:21 walk
3 16:20:22 walk
4 16:20:23 bus stop
5 16:20:24 bus stop
6 16:20:25 bus stop
7 16:20:26 bus stop
8 16:20:27 bus stop
9 16:20:28 walk
10 16:20:29 walk
11 16:20:30 walk
12 16:20:31 walk
13 16:20:32 walk
14 16:20:33 walk
15 16:20:34 car
16 16:20:35 car
17 16:20:36 car
18 16:20:37 car
19 16:20:38 car
I'd like to replace the values that are not correct. In the data example, the observations 4,5 and 7 have to be classified as "walk".
I think of a mutate function, that looks at the pre and post values ob the observations. if before and after the "unknown" values the value is the same mode (for examnple: walk) then the unknown values are classified as "walk". Also the observation 7 has somehow be classified as "walk".
thx for your help!
Upvotes: 1
Views: 161
Reputation: 26218
Changed your sample data a bit.
travel <- structure(list(time = structure(1:19, .Label = c("16:20:20",
"16:20:21", "16:20:22", "16:20:23", "16:20:24", "16:20:25", "16:20:26",
"16:20:27", "16:20:28", "16:20:29", "16:20:30", "16:20:31", "16:20:32",
"16:20:33", "16:20:34", "16:20:35", "16:20:36", "16:20:37", "16:20:38"
), class = "factor"), mode = structure(c(3L, 3L, 3L, 2L, 2L,
3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("car",
"unknown", "walk"), class = "factor")), class = "data.frame", row.names = c(NA,
-19L))
travel
time mode
1 16:20:20 walk
2 16:20:21 walk
3 16:20:22 walk
4 16:20:23 unknown
5 16:20:24 unknown
6 16:20:25 walk
7 16:20:26 car
8 16:20:27 walk
9 16:20:28 walk
10 16:20:29 walk
11 16:20:30 walk
12 16:20:31 walk
13 16:20:32 walk
14 16:20:33 unknown
15 16:20:34 car
16 16:20:35 car
17 16:20:36 car
18 16:20:37 car
19 16:20:38 car
Now use this syntax, for a maximum of two discrepancies
library(data.table)
travel %>% group_by(d = rleid(mode)) %>%
mutate(d2 = n()) %>% ungroup() %>%
mutate(mode = case_when( d2 == 1 & lag(mode) == lead(mode) ~ lag(mode),
d2 == 2 & first(mode[d]) == first(mode[d-1]) ~ first(mode[d-1]),
TRUE ~ mode)) %>%
select(-d, -d2)
# A tibble: 19 x 2
time mode
<fct> <fct>
1 16:20:20 walk
2 16:20:21 walk
3 16:20:22 walk
4 16:20:23 walk
5 16:20:24 walk
6 16:20:25 walk
7 16:20:26 walk
8 16:20:27 walk
9 16:20:28 walk
10 16:20:29 walk
11 16:20:30 walk
12 16:20:31 walk
13 16:20:32 walk
14 16:20:33 unknown
15 16:20:34 car
16 16:20:35 car
17 16:20:36 car
18 16:20:37 car
19 16:20:38 car
Upvotes: 2
Reputation: 8506
The first step could be to change the mode
"unknown" to NA and use zoo::na.locf
to fill in with the preceding non-NA values, if that is what you want.
Secondly, you could for example define a single occurrence of another mode
as something you would like to replace, again using the preceding known value. You can use rle
to find such cases.
df <- read.table(text=" time mode
1 16:20:20 walk
2 16:20:21 walk
3 16:20:22 walk
4 16:20:23 unknown
5 16:20:24 unknown
6 16:20:25 walk
7 16:20:26 car
8 16:20:27 walk
9 16:20:28 walk
10 16:20:29 walk
11 16:20:30 walk
12 16:20:31 walk
13 16:20:32 walk
14 16:20:33 walk
15 16:20:34 car
16 16:20:35 car
17 16:20:36 car
18 16:20:37 car
19 16:20:38 car
", header=T)
library(zoo)
df$mode[df$mode=="unknown"] <- NA
df$mode <- na.locf(df$mode)
rdf <- rle(df$mode)
rdf$values[which(rdf$lengths==1)] <- NA
df$mode <- na.locf(inverse.rle(rdf))
df
#> time mode
#> 1 16:20:20 walk
#> 2 16:20:21 walk
#> 3 16:20:22 walk
#> 4 16:20:23 walk
#> 5 16:20:24 walk
#> 6 16:20:25 walk
#> 7 16:20:26 walk
#> 8 16:20:27 walk
#> 9 16:20:28 walk
#> 10 16:20:29 walk
#> 11 16:20:30 walk
#> 12 16:20:31 walk
#> 13 16:20:32 walk
#> 14 16:20:33 walk
#> 15 16:20:34 car
#> 16 16:20:35 car
#> 17 16:20:36 car
#> 18 16:20:37 car
#> 19 16:20:38 car
Created on 2021-03-23 by the reprex package (v1.0.0)
Or, more succinctly, combining the two conditions upfront:
library(zoo)
rdf <- rle(df$mode)
rdf$values[with(rdf, lengths==1 | values=="unknown")] <- NA
df$mode <- na.locf(inverse.rle(rdf))
Upvotes: 1