Science11
Science11

Reputation: 883

Aggregate based on conditions r

I have dataset which contains four columnsX1 (id number), X2 (Datetime), X3 (Datetime), X4 (Duration) which is as follows

test2 = structure(list(X1 = c(549395L, 678018L, 706197L, 549395L, 775731L, 789858L, 845277L, 936749L, 845277L, 954953L), X2 = c("6/16/2014", "9/16/2014", "2/12/2014", "6/16/2014", "8/29/2014", "2/26/2014", "4/7/2014", "2/14/2014", "5/18/2014", "3/5/2014"), X3 = c("6/4/2014 11:10", "9/16/2014 10:23", "2/12/2014 12:21", "6/4/2014 11:10", "8/29/2014 8:51", "2/26/2014 12:49", "4/7/2014 13:59", "2/14/2014 12:08", "4/7/2014 01:39", "3/5/2014 10:14"), X4 = c(8L, 21L, 10L, 72L, 39L, 14L, 41L, 31L, 43L, 24L)), .Names = c("X1", "X2", "X3", "X4"), class = "data.frame", row.names = c(NA, -10L))

       X1        X2              X3 X4
1  549395 6/16/2014  6/4/2014 11:10  8
2  678018 9/16/2014 9/16/2014 10:23 21
3  706197 2/12/2014 2/12/2014 12:21 10
4  549395 6/16/2014  6/4/2014 11:10 72
5  775731 8/29/2014  8/29/2014 8:51 39
6  789858 2/26/2014 2/26/2014 12:49 14
7  845277  4/7/2014  4/7/2014 13:59 41
8  936749 2/14/2014 2/14/2014 12:08 31
9  845277 5/18/2014  4/7/2014 01:39 43
10 954953  3/5/2014  3/5/2014 10:14 24

What I want to accomplish is this

  1. If the X1 (ID numbers) have same date and time in X3 then replace their corresponding X4 with max of those x4. For example, Id Number (549395 ) has same date and time (x3) 6/4/2014 11:10, so X4 should be replaced by max(72, 8) ...72.

  2. if the X1 (ID numbers) have same date but different time then replace their corresponding X4 with sum of their individual x4s. For example, Id Number (845277) has same same date (4/7/2014) different time (13:59 , 1:39) so X4 should be sum (43,41) = 84.

The output should look like this.

       X1        X2              X3  X4
1  549395 6/16/2014  6/4/2014 11:10  72
2  678018 9/16/2014 9/16/2014 10:23  21
3  706197 2/12/2014 2/12/2014 12:21  10
4  549395 6/16/2014  6/4/2014 11:10  72
5  775731 8/29/2014 8/29/2014  8:51  39
6  789858 2/26/2014 2/26/2014 12:49  14
7  845277  4/7/2014  4/7/2014 13:59  84
8  936749 2/14/2014 2/14/2014 12:08  31
9  845277 5/18/2014  4/7/2014 01:39  84
10 954953  3/5/2014  3/5/2014 10:14  24

I tried using aggregate function but the results are no where near what I want to accomplish. So need help.

New data:

test2=   structure(list(X1 = c(1491930L, 3162932L, 3162932L, 4092879L, 
5374073L, 7427514L, 2377939L, 2377939L, 4081399L), X2 = structure(c(16073, 
16073, 16073, 16073, 16074, 16073, 16081, 16081, 16077), class = "Date"), 
X3 = structure(c(4L, 5L, 1L, 3L, 6L, 2L, 9L, 8L, 7L), .Label = c("2014-01-03 10:11", 
"2014-01-03 11:1", "2014-01-03 11:44", "2014-01-03 8:47", 
"2014-01-03 9:40", "2014-01-04 11:7", "2014-01-07 8:42", 
"2014-01-11 2:58", "2014-01-11 2:6"), class = "factor"), 
X4 = c(31, 26, 4, 32, 24, 132, 50, 16, 66)), .Names = c("X1", 
"X2", "X3", "X4"), row.names = c(NA, -9L), class = "data.frame")

Desired result:

          X1         X2       date       time  X4
        1 1491930 2014-01-03 2014-01-03  8:47  31
        2 3162932 2014-01-03 2014-01-03  9:40  30
        3 3162932 2014-01-03 2014-01-03 10:11  30
        4 4092879 2014-01-03 2014-01-03 11:44  32
        5 5374073 2014-01-04 2014-01-04  11:7  24
        6 7427514 2014-01-03 2014-01-03  11:1 132
        7 2377939 2014-01-11 2014-01-11   2:6  66
        8 2377939 2014-01-11 2014-01-11  2:58  66
        9 4081399 2014-01-07 2014-01-07  8:42  66

Upvotes: 2

Views: 1884

Answers (2)

Arun
Arun

Reputation: 118779

Here's a slightly simpler logic, implemented using data.table:

require(data.table)
setDT(test2)[, tmp := NA]

test2[, c("X4", "tmp") := if (.N > 1) list(max(X4), TRUE), by=.(X1, X3)] ## (1)
test2[is.na(tmp), X4 := sum(X4), by=.(X1, gsub(" .*$", "", X3))]         ## (2)
#         X1         X2               X3  X4 tmp
# 1: 1491930 2014-01-03  2014-01-03 8:47  31  NA
# 2: 3162932 2014-01-03  2014-01-03 9:40  30  NA
# 3: 3162932 2014-01-03 2014-01-03 10:11  30  NA
# 4: 4092879 2014-01-03 2014-01-03 11:44  32  NA
# 5: 5374073 2014-01-04  2014-01-04 11:7  24  NA
# 6: 7427514 2014-01-03  2014-01-03 11:1 132  NA
# 7: 2377939 2014-01-11   2014-01-11 2:6  66  NA
# 8: 2377939 2014-01-11  2014-01-11 2:58  66  NA
# 9: 4081399 2014-01-07  2014-01-07 8:42  66  NA
  1. Convert test2 from data.frame to data.table by reference (without any additional copy). Now test2 is a data.table on which we can use data.table syntax. And then we add a new (dummy) column (as a marker) with NA.

  2. Aggregate by X1 and X3 columns. If there are more than one row, those rows fall under your first condition. We've to replace those rows alone by their maximum values for X4. We do that in (1), but in addition, we also update the dummy column for those rows to TRUE.

  3. Then, we group by X1 and just the date part of X3, but only on the rows that are not modified in the previous step. These satisfy your second condition. And we replace X4 with sum of all values in that group.

This should also take care of the case where you have rows satisfying both conditions for a given X1 and date.

If necessary you can remove the tmp column by setting it to NULL:

test2[, tmp := NULL]

Upvotes: 1

talat
talat

Reputation: 70256

Here's an option how you could do it using dplyr and tidyr. I included some date formatting which increases the length of the "pipe", but it would be worth it IMO. I'll mark those parts of the pipe as "optional".

library(dplyr); library(tidyr)

test2 %>%
  separate(X3, into = c("date", "time"), sep = " ") %>%
  mutate_each(funs(as.Date(., format = "%Y-%m-%d")), X2, date) %>%
  group_by(X1, date, time) %>%
  mutate(X4 = max(X4)) %>%
  group_by(X1, date) %>%
  mutate(X4 = ifelse(n_distinct(time) == 1L, X4, sum(X4))) %>%
  ungroup() %>%
  unite(X3, date:time, sep = " ") %>%
  mutate(X3 = as.POSIXct(X3))

Using the new data set, this produces:

Source: local data frame [9 x 4]

       X1         X2                  X3  X4
1 1491930 2014-01-03 2014-01-03 08:47:00  31
2 3162932 2014-01-03 2014-01-03 09:40:00  30
3 3162932 2014-01-03 2014-01-03 10:11:00  30
4 4092879 2014-01-03 2014-01-03 11:44:00  32
5 5374073 2014-01-04 2014-01-04 11:07:00  24
6 7427514 2014-01-03 2014-01-03 11:01:00 132
7 2377939 2014-01-11 2014-01-11 02:06:00  66
8 2377939 2014-01-11 2014-01-11 02:58:00  66
9 4081399 2014-01-07 2014-01-07 08:42:00  66

I start off by separating the original X3 column into two columns data and time (using separate from tidyr) and formatting columns X2 and date as a real Date format. Next, I group the data by X1, date and time and then replace the X4 with the maximum X4 within each date/time group. Then I group the data only by X1 and date and check if the number of unique time values is 1. If so, return X4 unchanged, otherwise, return the sum of X4. What follows is ungrouping the data, uniting the columns date and time back to a single column X3 and formatting X3 as datetime using as.POSIXct.


There's a hypothetical case in which it's not 100% clear what behavior you expect but it might be different from the code above does. This would apply if there there could be an id (X1) and date combination that has for example 5 observations / rows and in three of them the time is the same and in two the time are different. The code above would result in X4 all the same for all observations of that id-date combination. If you don't want that, you could use the modified code below. I'll also provide a modified example to demonstrate the behavior.

First, let's create a new data set test3 and modify it so that for X1 == 3162932 and date == 2014-01-03 there are 4 time entries: 2 are the same (so for those the X4 should be replaced by its max) and 2 are different (so for those we should replace X4 with their sum of X4s).

test3 <- test2
test3$X1[1:4] <- "3162932"
test3$X3[2] <- "2014-01-03 8:47"

> test3
       X1         X2               X3  X4  # all 4 first rows have the same id
1 3162932 2014-01-03  2014-01-03 8:47  31   #   the first two are 
2 3162932 2014-01-03  2014-01-03 8:47  26   #     same id, same date, same time
3 3162932 2014-01-03 2014-01-03 10:11   4   #   this and the next are
4 3162932 2014-01-03 2014-01-03 11:44  32   #     same id, same date, different times
5 5374073 2014-01-04  2014-01-04 11:7  24
6 7427514 2014-01-03  2014-01-03 11:1 132
7 2377939 2014-01-11   2014-01-11 2:6  50
8 2377939 2014-01-11  2014-01-11 2:58  16
9 4081399 2014-01-07  2014-01-07 8:42  66

Here's how you could handle it as I think you'd like (but again, you haven't really specified that):

test3 %>%
  separate(X3, into = c("date", "time"), sep = " ") %>%
  mutate_each(funs(as.Date(., format = "%Y-%m-%d")), X2, date) %>%
  group_by(X1, date, time) %>%
  mutate(X4 = max(X4), check = n() == 1L) %>%
  group_by(X1, date) %>%
  mutate(X4 = ifelse(n_distinct(time) > 1L & check, sum(X4[check]), X4)) %>%
  ungroup() %>%
  unite(X3, date:time, sep = " ") %>%
  mutate(X3 = as.POSIXct(X3)) %>%
  select(ID = X1, Date = X2, DateTime = X3, Value = X4)

Source: local data frame [9 x 4]

       ID       Date            DateTime Value
1 3162932 2014-01-03 2014-01-03 08:47:00    31    # replaced by max
2 3162932 2014-01-03 2014-01-03 08:47:00    31    # replaced by max
3 3162932 2014-01-03 2014-01-03 10:11:00    36    # replaced by sum of 4 + 32
4 3162932 2014-01-03 2014-01-03 11:44:00    36    # replaced by sum of 4 + 32
5 5374073 2014-01-04 2014-01-04 11:07:00    24
6 7427514 2014-01-03 2014-01-03 11:01:00   132
7 2377939 2014-01-11 2014-01-11 02:06:00    66
8 2377939 2014-01-11 2014-01-11 02:58:00    66
9 4081399 2014-01-07 2014-01-07 08:42:00    66

Upvotes: 1

Related Questions