jachty
jachty

Reputation: 21

Assign ID by matching conditions over different columns

Hi Guys I would like to assign an ID to the following dataframe. The Bus has to be the same if the arrival station matches departure station and if arrival time matches with the departure time. Does anyone has an idea how to solve this issue? Thanks in advance!

Example

I would like to have the following:

Expected outcome

Upvotes: 0

Views: 52

Answers (1)

drf
drf

Reputation: 8699

One approach is to create a graph object where the composite key (station, timestamp) represents a graph vertex, and each route represents an edge. From this graph, each connected component then represents a unique route, so in your example you have the two components:

 Component 1: (Station1 10:10) -> (Station2 10:15) -> (Station3 10:18) -> (Station4 10:20)
 Component 2: (Station10 10:12) -> (Station 10:25)

Using the igraph and Tidyverse packages (here dplyr, magrittr, and tibble), this approach could be implemented like so:

# df is source data.  create a composite key for 
# arrival and departure by concatenating station 
# name and timestamp
df %<>% mutate(arrkey = paste0(From, Departure),
                 depkey = paste0(To, Arrival));

# create graph, identify clusters, and convert clusters to data frame
components <- graph_from_data_frame(df %>% select(arrkey, depkey)) %>%
   components() %>%
   `$`('membership') %>% 
   as.data.frame() %>%
   tibble::rownames_to_column() %T>%
   {names(.) <- c('vertexkey','component')}

# join components with original data frame to produce output
df %>% inner_join(components, by=c('arrkey'='vertexkey')) %>%
    select(ID=component, everything()) %>%
    select(-arrkey, -depkey) %>%
    arrange(ID, Departure)

This produces the desired output:

  ID      From        To Departure Arrival
1  1  Station1  Station2     10:10   10:15
2  1  Station2  Station3     10:15   10:18
3  1  Station3  Station4     10:18   10:20
4  2 Station10 Station15     10:12   10:25

Note: I used the following code to produce df (removing the date from departure/arrival for simplicity):

df <- data.frame(
    From=c('Station1', 'Station10', 'Station2', 'Station3'),
    To=c('Station2', 'Station15', 'Station3', 'Station4'),
    Departure = c('10:10','10:12','10:15','10:18'),
    Arrival = c('10:15','10:25','10:18','10:20'));

Upvotes: 1

Related Questions