user9630195
user9630195

Reputation: 91

R: efficient dataframe split-apply(concatenate)

I am hoping to get some expert advice on a efficient way to remove rows of a dataframe in one column (columnA) where there is a duplicated value and at the same time create a new variable that concatenates entries of another column (columnB) in cases where columnA had duplicates. Below I provide a toy dataframe:

my_df <- data.frame('DateTime' = c('2017/05/05 08:30:00', '2017/05/05 08:30:00', 
                               '2017/05/05 08:30:00', '2017/12/08 08:30:00',
                              '2018/01/15 18:50:00', '2017/12/20 21:46:00',
                              '2017/11/12 18:50:00', '2017/11/03 08:30:00',
                              '2017/11/03 08:30:00', '2017/12/03 08:30:00'),
                     'Event' = c('A', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'B', 'A'),
                     'Var1' = rnorm(10),
                     stringsAsFactors = FALSE)

In this dataframe DateTime column is a character column and 2017/05/05 08:30:00 appears 3 times while 2017/11/03 08:30:00 appears twice. My goal is to collapse rows where there are duplicates of DateTime and to have a new column that concatenates the Event entries. So the new column, lets say AllEvents, should have entry A-B-C for 2017/05/05 08:30:00 row. For row 2017/11/03 08:30:00 in the DateTime column AllEvents should have a value of A-B. Finally, for all other rows, AllEvents should match Event column.

My attempt seems very clunky.

I first extract unique values of DateTime where there is more than one entr:

require(dplyr)
duped_datetime <- unique(my_df[duplicated(my_df$DateTime), 'DateTime'])

I then subset my_df to extract entries where there are duplicates

subset_df <- my_df[my_df$DateTime %in% duped_datetime,]

Next I create a function that concatenates a vector:

my_concat <- function(x){
concat_str <- subset_df %>% filter(DateTime == x) %>% 
                            select(Event) %>% 
                            unlist() %>% 
                            paste(collapse="+") 
return(concat_str)
}

Next I loop through the duplicated dates and apply the my_concat function:

named_vc <- sapply(duped_datetime, FUN = my_concat)

The results are combined into a new dataframe

new_df <- data.frame('DateTime' = duped_datetime,
                     'AllEvents' = unname(named_vc), 
                      stringsAsFactors = FALSE)

The results are merged and the final_df is cleaned up to keep the rows and columns I need.

final_df <- left_join(my_df, new_df, by = 'DateTime')  %>% 
            mutate(AllEvents = ifelse(is.na(AllEvents), Event, AllEvents)) 
final_df <- final_df[!duplicated(final_df$DateTime),]
final_df['Event'] <- NULL  

I get the result that I need but as you can see the code is hideous. The whole thing can be done in 4 lines in Python with groupby, apply, and lambda functions but it is not immediately obvious to me how to get the same task done cleanly in R.

    DateTime            Var1      AllEvents
2017/05/05 08:30:00   -0.8350209    A+B+C
2017/12/08 08:30:00    1.1534819    A
2018/01/15 18:50:00   -0.3501990    A
2017/12/20 21:46:00   -0.6664841    B
2017/11/12 18:50:00    1.7142981    C
2017/11/03 08:30:00   -2.0133559    A+B
2017/12/03 08:30:00   -0.6150040    A

Thanks to anyone who has the patience of reading this.

Upvotes: 1

Views: 102

Answers (2)

IceCreamToucan
IceCreamToucan

Reputation: 28685

Just for the sake of variety, here's the same thing in data.table

library(data.table)
setDT(my_df)

my_df[, .(Var1  = first(Var1)
        , Event = paste0(Event, collapse = "+"))
      , by = DateTime]

Gives

              DateTime       Var1 Event
1: 2017/05/05 08:30:00  0.2366874 A+B+C
2: 2017/12/08 08:30:00  0.3699069     A
3: 2018/01/15 18:50:00 -0.2420663     A
4: 2017/12/20 21:46:00 -1.4720633     B
5: 2017/11/12 18:50:00 -0.5961595     C
6: 2017/11/03 08:30:00 -1.1467001   A+B
7: 2017/12/03 08:30:00 -0.6135086     A

Note that the order of my_df is preserved.

Upvotes: 3

Marius
Marius

Reputation: 60070

This can be done straightforwardly in dplyr, group_by works fine with the DateTime values:

my_df %>%
    group_by(DateTime) %>%
    summarise(Var1 = first(Var1),
              Event = paste0(Event, collapse = "+"))

Output:

# A tibble: 7 x 3
  DateTime              Var1 Event
  <chr>                <dbl> <chr>
1 2017/05/05 08:30:00  0.159 A+B+C
2 2017/11/03 08:30:00 -0.610 A+B  
3 2017/11/12 18:50:00  0.465 C    
4 2017/12/03 08:30:00 -1.89  A    
5 2017/12/08 08:30:00  0.793 A    
6 2017/12/20 21:46:00  0.755 B    
7 2018/01/15 18:50:00  0.511 A  

Upvotes: 5

Related Questions