Reputation: 883
I have dataset which contains four columnsX1
(id number), X2
(Datetime), X3
(Datetime), X4
(Duration) which is as follows
test2 = structure(list(X1 = c(549395L, 678018L, 706197L, 549395L, 775731L, 789858L, 845277L, 936749L, 845277L, 954953L), X2 = c("6/16/2014", "9/16/2014", "2/12/2014", "6/16/2014", "8/29/2014", "2/26/2014", "4/7/2014", "2/14/2014", "5/18/2014", "3/5/2014"), X3 = c("6/4/2014 11:10", "9/16/2014 10:23", "2/12/2014 12:21", "6/4/2014 11:10", "8/29/2014 8:51", "2/26/2014 12:49", "4/7/2014 13:59", "2/14/2014 12:08", "4/7/2014 01:39", "3/5/2014 10:14"), X4 = c(8L, 21L, 10L, 72L, 39L, 14L, 41L, 31L, 43L, 24L)), .Names = c("X1", "X2", "X3", "X4"), class = "data.frame", row.names = c(NA, -10L))
X1 X2 X3 X4
1 549395 6/16/2014 6/4/2014 11:10 8
2 678018 9/16/2014 9/16/2014 10:23 21
3 706197 2/12/2014 2/12/2014 12:21 10
4 549395 6/16/2014 6/4/2014 11:10 72
5 775731 8/29/2014 8/29/2014 8:51 39
6 789858 2/26/2014 2/26/2014 12:49 14
7 845277 4/7/2014 4/7/2014 13:59 41
8 936749 2/14/2014 2/14/2014 12:08 31
9 845277 5/18/2014 4/7/2014 01:39 43
10 954953 3/5/2014 3/5/2014 10:14 24
What I want to accomplish is this
If the X1
(ID numbers) have same date and time in X3
then replace their corresponding X4
with max
of those x4
. For example, Id Number (549395 ) has same date and time (x3) 6/4/2014 11:10, so X4
should be replaced by max(72, 8) ...72
.
if the X1
(ID numbers) have same date but different time then replace their corresponding X4
with sum
of their individual x4s. For example, Id Number (845277) has same same date (4/7/2014) different time (13:59 , 1:39) so X4
should be sum (43,41) = 84
.
The output should look like this.
X1 X2 X3 X4
1 549395 6/16/2014 6/4/2014 11:10 72
2 678018 9/16/2014 9/16/2014 10:23 21
3 706197 2/12/2014 2/12/2014 12:21 10
4 549395 6/16/2014 6/4/2014 11:10 72
5 775731 8/29/2014 8/29/2014 8:51 39
6 789858 2/26/2014 2/26/2014 12:49 14
7 845277 4/7/2014 4/7/2014 13:59 84
8 936749 2/14/2014 2/14/2014 12:08 31
9 845277 5/18/2014 4/7/2014 01:39 84
10 954953 3/5/2014 3/5/2014 10:14 24
I tried using aggregate function but the results are no where near what I want to accomplish. So need help.
test2= structure(list(X1 = c(1491930L, 3162932L, 3162932L, 4092879L,
5374073L, 7427514L, 2377939L, 2377939L, 4081399L), X2 = structure(c(16073,
16073, 16073, 16073, 16074, 16073, 16081, 16081, 16077), class = "Date"),
X3 = structure(c(4L, 5L, 1L, 3L, 6L, 2L, 9L, 8L, 7L), .Label = c("2014-01-03 10:11",
"2014-01-03 11:1", "2014-01-03 11:44", "2014-01-03 8:47",
"2014-01-03 9:40", "2014-01-04 11:7", "2014-01-07 8:42",
"2014-01-11 2:58", "2014-01-11 2:6"), class = "factor"),
X4 = c(31, 26, 4, 32, 24, 132, 50, 16, 66)), .Names = c("X1",
"X2", "X3", "X4"), row.names = c(NA, -9L), class = "data.frame")
Desired result:
X1 X2 date time X4
1 1491930 2014-01-03 2014-01-03 8:47 31
2 3162932 2014-01-03 2014-01-03 9:40 30
3 3162932 2014-01-03 2014-01-03 10:11 30
4 4092879 2014-01-03 2014-01-03 11:44 32
5 5374073 2014-01-04 2014-01-04 11:7 24
6 7427514 2014-01-03 2014-01-03 11:1 132
7 2377939 2014-01-11 2014-01-11 2:6 66
8 2377939 2014-01-11 2014-01-11 2:58 66
9 4081399 2014-01-07 2014-01-07 8:42 66
Upvotes: 2
Views: 1884
Reputation: 118779
Here's a slightly simpler logic, implemented using data.table
:
require(data.table)
setDT(test2)[, tmp := NA]
test2[, c("X4", "tmp") := if (.N > 1) list(max(X4), TRUE), by=.(X1, X3)] ## (1)
test2[is.na(tmp), X4 := sum(X4), by=.(X1, gsub(" .*$", "", X3))] ## (2)
# X1 X2 X3 X4 tmp
# 1: 1491930 2014-01-03 2014-01-03 8:47 31 NA
# 2: 3162932 2014-01-03 2014-01-03 9:40 30 NA
# 3: 3162932 2014-01-03 2014-01-03 10:11 30 NA
# 4: 4092879 2014-01-03 2014-01-03 11:44 32 NA
# 5: 5374073 2014-01-04 2014-01-04 11:7 24 NA
# 6: 7427514 2014-01-03 2014-01-03 11:1 132 NA
# 7: 2377939 2014-01-11 2014-01-11 2:6 66 NA
# 8: 2377939 2014-01-11 2014-01-11 2:58 66 NA
# 9: 4081399 2014-01-07 2014-01-07 8:42 66 NA
Convert test2
from data.frame to data.table by reference (without any additional copy). Now test2
is a data.table on which we can use data.table syntax. And then we add a new (dummy) column (as a marker) with NA
.
Aggregate by X1
and X3
columns. If there are more than one row, those rows fall under your first condition. We've to replace those rows alone by their maximum values for X4
. We do that in (1), but in addition, we also update the dummy column for those rows to TRUE.
Then, we group by X1
and just the date part of X3
, but only on the rows that are not modified in the previous step. These satisfy your second condition. And we replace X4
with sum of all values in that group.
This should also take care of the case where you have rows satisfying both conditions for a given X1
and date.
If necessary you can remove the tmp
column by setting it to NULL
:
test2[, tmp := NULL]
Upvotes: 1
Reputation: 70256
Here's an option how you could do it using dplyr and tidyr. I included some date formatting which increases the length of the "pipe", but it would be worth it IMO. I'll mark those parts of the pipe as "optional".
library(dplyr); library(tidyr)
test2 %>%
separate(X3, into = c("date", "time"), sep = " ") %>%
mutate_each(funs(as.Date(., format = "%Y-%m-%d")), X2, date) %>%
group_by(X1, date, time) %>%
mutate(X4 = max(X4)) %>%
group_by(X1, date) %>%
mutate(X4 = ifelse(n_distinct(time) == 1L, X4, sum(X4))) %>%
ungroup() %>%
unite(X3, date:time, sep = " ") %>%
mutate(X3 = as.POSIXct(X3))
Using the new data set, this produces:
Source: local data frame [9 x 4]
X1 X2 X3 X4
1 1491930 2014-01-03 2014-01-03 08:47:00 31
2 3162932 2014-01-03 2014-01-03 09:40:00 30
3 3162932 2014-01-03 2014-01-03 10:11:00 30
4 4092879 2014-01-03 2014-01-03 11:44:00 32
5 5374073 2014-01-04 2014-01-04 11:07:00 24
6 7427514 2014-01-03 2014-01-03 11:01:00 132
7 2377939 2014-01-11 2014-01-11 02:06:00 66
8 2377939 2014-01-11 2014-01-11 02:58:00 66
9 4081399 2014-01-07 2014-01-07 08:42:00 66
I start off by separating the original X3 column into two columns data and time (using separate from tidyr) and formatting columns X2 and date as a real Date format. Next, I group the data by X1, date and time and then replace the X4 with the maximum X4 within each date/time group. Then I group the data only by X1 and date and check if the number of unique time values is 1. If so, return X4 unchanged, otherwise, return the sum of X4. What follows is ungrouping the data, uniting the columns date and time back to a single column X3 and formatting X3 as datetime using as.POSIXct.
There's a hypothetical case in which it's not 100% clear what behavior you expect but it might be different from the code above does. This would apply if there there could be an id (X1) and date combination that has for example 5 observations / rows and in three of them the time is the same and in two the time are different. The code above would result in X4 all the same for all observations of that id-date combination. If you don't want that, you could use the modified code below. I'll also provide a modified example to demonstrate the behavior.
First, let's create a new data set test3 and modify it so that for X1 == 3162932
and date == 2014-01-03
there are 4 time entries: 2 are the same (so for those the X4 should be replaced by its max) and 2 are different (so for those we should replace X4 with their sum of X4s).
test3 <- test2
test3$X1[1:4] <- "3162932"
test3$X3[2] <- "2014-01-03 8:47"
> test3
X1 X2 X3 X4 # all 4 first rows have the same id
1 3162932 2014-01-03 2014-01-03 8:47 31 # the first two are
2 3162932 2014-01-03 2014-01-03 8:47 26 # same id, same date, same time
3 3162932 2014-01-03 2014-01-03 10:11 4 # this and the next are
4 3162932 2014-01-03 2014-01-03 11:44 32 # same id, same date, different times
5 5374073 2014-01-04 2014-01-04 11:7 24
6 7427514 2014-01-03 2014-01-03 11:1 132
7 2377939 2014-01-11 2014-01-11 2:6 50
8 2377939 2014-01-11 2014-01-11 2:58 16
9 4081399 2014-01-07 2014-01-07 8:42 66
Here's how you could handle it as I think you'd like (but again, you haven't really specified that):
test3 %>%
separate(X3, into = c("date", "time"), sep = " ") %>%
mutate_each(funs(as.Date(., format = "%Y-%m-%d")), X2, date) %>%
group_by(X1, date, time) %>%
mutate(X4 = max(X4), check = n() == 1L) %>%
group_by(X1, date) %>%
mutate(X4 = ifelse(n_distinct(time) > 1L & check, sum(X4[check]), X4)) %>%
ungroup() %>%
unite(X3, date:time, sep = " ") %>%
mutate(X3 = as.POSIXct(X3)) %>%
select(ID = X1, Date = X2, DateTime = X3, Value = X4)
Source: local data frame [9 x 4]
ID Date DateTime Value
1 3162932 2014-01-03 2014-01-03 08:47:00 31 # replaced by max
2 3162932 2014-01-03 2014-01-03 08:47:00 31 # replaced by max
3 3162932 2014-01-03 2014-01-03 10:11:00 36 # replaced by sum of 4 + 32
4 3162932 2014-01-03 2014-01-03 11:44:00 36 # replaced by sum of 4 + 32
5 5374073 2014-01-04 2014-01-04 11:07:00 24
6 7427514 2014-01-03 2014-01-03 11:01:00 132
7 2377939 2014-01-11 2014-01-11 02:06:00 66
8 2377939 2014-01-11 2014-01-11 02:58:00 66
9 4081399 2014-01-07 2014-01-07 08:42:00 66
Upvotes: 1