R: efficient dataframe split-apply(concatenate)

Question

I am hoping to get some expert advice on a efficient way to remove rows of a dataframe in one column (columnA) where there is a duplicated value and at the same time create a new variable that concatenates entries of another column (columnB) in cases where columnA had duplicates. Below I provide a toy dataframe:

my_df <- data.frame('DateTime' = c('2017/05/05 08:30:00', '2017/05/05 08:30:00', 
                               '2017/05/05 08:30:00', '2017/12/08 08:30:00',
                              '2018/01/15 18:50:00', '2017/12/20 21:46:00',
                              '2017/11/12 18:50:00', '2017/11/03 08:30:00',
                              '2017/11/03 08:30:00', '2017/12/03 08:30:00'),
                     'Event' = c('A', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'B', 'A'),
                     'Var1' = rnorm(10),
                     stringsAsFactors = FALSE)

In this dataframe DateTime column is a character column and 2017/05/05 08:30:00 appears 3 times while 2017/11/03 08:30:00 appears twice. My goal is to collapse rows where there are duplicates of DateTime and to have a new column that concatenates the Event entries. So the new column, lets say AllEvents, should have entry A-B-C for 2017/05/05 08:30:00 row. For row 2017/11/03 08:30:00 in the DateTime column AllEvents should have a value of A-B. Finally, for all other rows, AllEvents should match Event column.

My attempt seems very clunky.

I first extract unique values of DateTime where there is more than one entr:

require(dplyr)
duped_datetime <- unique(my_df[duplicated(my_df$DateTime), 'DateTime'])

I then subset my_df to extract entries where there are duplicates

subset_df <- my_df[my_df$DateTime %in% duped_datetime,]

Next I create a function that concatenates a vector:

my_concat <- function(x){
concat_str <- subset_df %>% filter(DateTime == x) %>% 
                            select(Event) %>% 
                            unlist() %>% 
                            paste(collapse="+") 
return(concat_str)
}

Next I loop through the duplicated dates and apply the my_concat function:

named_vc <- sapply(duped_datetime, FUN = my_concat)

The results are combined into a new dataframe

new_df <- data.frame('DateTime' = duped_datetime,
                     'AllEvents' = unname(named_vc), 
                      stringsAsFactors = FALSE)

The results are merged and the final_df is cleaned up to keep the rows and columns I need.

final_df <- left_join(my_df, new_df, by = 'DateTime')  %>% 
            mutate(AllEvents = ifelse(is.na(AllEvents), Event, AllEvents)) 
final_df <- final_df[!duplicated(final_df$DateTime),]
final_df['Event'] <- NULL

I get the result that I need but as you can see the code is hideous. The whole thing can be done in 4 lines in Python with groupby, apply, and lambda functions but it is not immediately obvious to me how to get the same task done cleanly in R.

    DateTime            Var1      AllEvents
2017/05/05 08:30:00   -0.8350209    A+B+C
2017/12/08 08:30:00    1.1534819    A
2018/01/15 18:50:00   -0.3501990    A
2017/12/20 21:46:00   -0.6664841    B
2017/11/12 18:50:00    1.7142981    C
2017/11/03 08:30:00   -2.0133559    A+B
2017/12/03 08:30:00   -0.6150040    A

Thanks to anyone who has the patience of reading this.

Marius · Accepted Answer

This can be done straightforwardly in dplyr, group_by works fine with the DateTime values:

my_df %>%
    group_by(DateTime) %>%
    summarise(Var1 = first(Var1),
              Event = paste0(Event, collapse = "+"))

Output:

# A tibble: 7 x 3
  DateTime              Var1 Event
                   
1 2017/05/05 08:30:00  0.159 A+B+C
2 2017/11/03 08:30:00 -0.610 A+B  
3 2017/11/12 18:50:00  0.465 C    
4 2017/12/03 08:30:00 -1.89  A    
5 2017/12/08 08:30:00  0.793 A    
6 2017/12/20 21:46:00  0.755 B    
7 2018/01/15 18:50:00  0.511 A

R: efficient dataframe split-apply(concatenate)

Answers (2)

Related Questions