LoF10
LoF10

Reputation: 2127

How to code a factor variable when a value lies between two other factors either with a new column or by adding levels?

I have the following df:

    id    time              x     y      pickup_dropoff
    1    2/1/2013 12:23    73    40       pickup
    1    2/1/2013 12:25    73    40.2     ping
    1    2/1/2013 12.27    73    40.5     ping
    1    2/1/2013 12:34    73    41       dropoff
    1    2/1/2013 12:35    73    41.4     ping
    1    1/1/2013 12:45   73.6   41       pickup
    1    1/1/2013 12:57   73.5   41       dropoff
    2    1/2/2013 12:54   73.6   42       ping   
    2    1/2/2013 13:00   73.45  42       pickup
    2    1/2/2013 14:00   73     42       dropoff
    2    1/2/2013 14:50   73.11  41       pickup
    2    1/2/2013 15:30   73     44       dropoff
    2    1/2/2013 16:00   73.1   41       pickup
    2    1/2/2013 18:00    74    42       dropoff

Thanks to the help I received in this post: Reshape Data partially from Wide to Long in R

I was able reshape the data to resemble the above. I'm looking now to recode some of the factor values to show when a vehicle is in use or is cruising without being in use, This new variable would make the following assumptions:

  1. if a ping is between a pickup and a dropoff the vehicle is in use
  2. if a ping is between a dropoff and a pickup its out of use

I'd like the output to look like the following:

        id    time              x     y      pickup_dropoff     status
         1    2/1/2013 12:23    73    40       pickup           pickup
         1    2/1/2013 12:25    73    40.2     ping              inuse      
         1    2/1/2013 12.27    73    40.5     ping              inuse
         1    2/1/2013 12:34    73    41       dropoff           dropoff
         1    2/1/2013 12:35    73    41.4     ping              nouse
         1    1/1/2013 12:45   73.6   41       pickup            pickup
         1    1/1/2013 12:57   73.5   41       dropoff           dropoff
         2    1/2/2013 12:54   73.6   42       ping              unknown
         2    1/2/2013 13:00   73.45  42       pickup            pickup 
         2    1/2/2013 14:00   73     42       dropoff           dropoff
         2    1/2/2013 14:50   73.11  41       pickup            pickup
         2    1/2/2013 15:30   73     44       dropoff           dropoff
         2    1/2/2013 16:00   73.1   41       pickup            pickup 
         2    1/2/2013 18:00    74    42       dropoff           dropoff 

I currently have pickup_dropoff coded as a factor with 3 levels.

One solution I am playing with is adding a column with the factor levels of 1, 2, 3, then using as.numeric to turn them into numericals and then writing a couple of if statements like the following:

            df$status = ifelse(df$pickup_dropoff LAYS BETWEEN 3
            and 1, df$pickup_dropoff == "inuse", df$pickup_dropoff)

I may be overthinking this, but I'm not sure if there is a way to say "in between" in R. Also I have to deal with another dimension "id" since I don't want a ping between two different ids to be considered in use. In any case it would be considered "unknown" as the data I am working with is incomplete.

Any help is appreciated. Thanks!

Upvotes: 2

Views: 165

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145965

I think this will work

library(dplyr)
df %>% mutate(
    status = ifelse(pickup_dropoff == "pickup", "inuse",
        ifelse(pickup_dropoff == "dropoff", "nouse", NA))
) %>%
group_by(id) %>%
mutate(status = zoo::na.locf(status, na.rm = F),
       status = ifelse(pickup_dropoff %in% c("pickup", "dropoff"), pickup_dropoff, status),
       status = ifelse(is.na(status), "unknown", status))

First will put in the values for pickup and dropoff that we want the new column to take after pickup and dropoff, leaving everything else as NA. Then we fill in the missing values using zoo::na.locf (grouped by ID). Lastly, we reset the values at pickup and dropoff to what we actually want.

This creates a character vector - you can of course stick a factor conversion at the end.


Using plyr or base instead of dplyr:

df$status = with(df, ifelse(pickup_dropoff == "pickup", "inuse",
                ifelse(pickup_dropoff == "dropoff", "nouse", NA))

## pick one
# base
df$status = ave(df$status, df$id, FUN = function(x) zoo::na.locf(x, na.rm = F))
# plyr
df = plyr::ddply(df, "id", plyr::mutate, status = zoo::na.locf(status, na.rm = F))

df$status = with(df, ifelse(pickup_dropoff %in% c("pickup", "dropoff"), pickup_dropoff, status))
df$status = with(df, ifelse(is.na(status), "unknown", status))

Upvotes: 2

Related Questions