Reputation: 91
I am hoping to get some expert advice on a efficient way to remove rows of a dataframe in one column (columnA) where there is a duplicated value and at the same time create a new variable that concatenates entries of another column (columnB) in cases where columnA had duplicates. Below I provide a toy dataframe:
my_df <- data.frame('DateTime' = c('2017/05/05 08:30:00', '2017/05/05 08:30:00',
'2017/05/05 08:30:00', '2017/12/08 08:30:00',
'2018/01/15 18:50:00', '2017/12/20 21:46:00',
'2017/11/12 18:50:00', '2017/11/03 08:30:00',
'2017/11/03 08:30:00', '2017/12/03 08:30:00'),
'Event' = c('A', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'B', 'A'),
'Var1' = rnorm(10),
stringsAsFactors = FALSE)
In this dataframe DateTime column is a character column and 2017/05/05 08:30:00 appears 3 times while 2017/11/03 08:30:00 appears twice. My goal is to collapse rows where there are duplicates of DateTime
and to have a new column that concatenates the Event
entries. So the new column, lets say AllEvents
, should have entry A-B-C
for 2017/05/05 08:30:00 row. For row 2017/11/03 08:30:00 in the DateTime
column AllEvents
should have a value of A-B
. Finally, for all other rows, AllEvents
should match Event
column.
My attempt seems very clunky.
I first extract unique values of DateTime where there is more than one entr:
require(dplyr)
duped_datetime <- unique(my_df[duplicated(my_df$DateTime), 'DateTime'])
I then subset my_df to extract entries where there are duplicates
subset_df <- my_df[my_df$DateTime %in% duped_datetime,]
Next I create a function that concatenates a vector:
my_concat <- function(x){
concat_str <- subset_df %>% filter(DateTime == x) %>%
select(Event) %>%
unlist() %>%
paste(collapse="+")
return(concat_str)
}
Next I loop through the duplicated dates and apply the my_concat
function:
named_vc <- sapply(duped_datetime, FUN = my_concat)
The results are combined into a new dataframe
new_df <- data.frame('DateTime' = duped_datetime,
'AllEvents' = unname(named_vc),
stringsAsFactors = FALSE)
The results are merged and the final_df is cleaned up to keep the rows and columns I need.
final_df <- left_join(my_df, new_df, by = 'DateTime') %>%
mutate(AllEvents = ifelse(is.na(AllEvents), Event, AllEvents))
final_df <- final_df[!duplicated(final_df$DateTime),]
final_df['Event'] <- NULL
I get the result that I need but as you can see the code is hideous. The whole thing can be done in 4 lines in Python with groupby, apply, and lambda functions but it is not immediately obvious to me how to get the same task done cleanly in R.
DateTime Var1 AllEvents
2017/05/05 08:30:00 -0.8350209 A+B+C
2017/12/08 08:30:00 1.1534819 A
2018/01/15 18:50:00 -0.3501990 A
2017/12/20 21:46:00 -0.6664841 B
2017/11/12 18:50:00 1.7142981 C
2017/11/03 08:30:00 -2.0133559 A+B
2017/12/03 08:30:00 -0.6150040 A
Thanks to anyone who has the patience of reading this.
Upvotes: 1
Views: 102
Reputation: 28685
Just for the sake of variety, here's the same thing in data.table
library(data.table)
setDT(my_df)
my_df[, .(Var1 = first(Var1)
, Event = paste0(Event, collapse = "+"))
, by = DateTime]
Gives
DateTime Var1 Event
1: 2017/05/05 08:30:00 0.2366874 A+B+C
2: 2017/12/08 08:30:00 0.3699069 A
3: 2018/01/15 18:50:00 -0.2420663 A
4: 2017/12/20 21:46:00 -1.4720633 B
5: 2017/11/12 18:50:00 -0.5961595 C
6: 2017/11/03 08:30:00 -1.1467001 A+B
7: 2017/12/03 08:30:00 -0.6135086 A
Note that the order of my_df
is preserved.
Upvotes: 3
Reputation: 60070
This can be done straightforwardly in dplyr
, group_by
works fine with the DateTime
values:
my_df %>%
group_by(DateTime) %>%
summarise(Var1 = first(Var1),
Event = paste0(Event, collapse = "+"))
Output:
# A tibble: 7 x 3
DateTime Var1 Event
<chr> <dbl> <chr>
1 2017/05/05 08:30:00 0.159 A+B+C
2 2017/11/03 08:30:00 -0.610 A+B
3 2017/11/12 18:50:00 0.465 C
4 2017/12/03 08:30:00 -1.89 A
5 2017/12/08 08:30:00 0.793 A
6 2017/12/20 21:46:00 0.755 B
7 2018/01/15 18:50:00 0.511 A
Upvotes: 5