Reputation: 43
I have a data frame like this:
df <- data.frame(ID = c("A", "A", "B", "B", "C", "C"),
time = c(3.1,3.2,6.5,12.3, 3.2, 3.4),
intensity = c(10, 20, 30, 40, 50, 60))
|ID | time| intensity| |:--|----:|---------:| |A | 3.1| 10| |A | 3.2| 20| |B | 6.5| 30| |B | 12.3| 40| |C | 3.2| 50| |C | 3.4| 60|
I would like to aggregate values (sum intensities) by ID only when time difference is less than, i.e. 0.3. First I calculated this time difference:
df.2 <- df %>%
group_by(ID) %>%
mutate(time.diff = max(time) - min(time))
...resulting in:
|ID | time| intensity| time.diff| |:--|----:|---------:|---------:| |A | 3.1| 10| 0.1| |A | 3.2| 20| 0.1| |B | 6.5| 30| 5.8| |B | 12.3| 40| 5.8| |C | 3.2| 50| 0.2| |C | 3.4| 60| 0.2|
Just to be clear, what I would like to get as an output would be:
|ID | time| intensity| time.diff| |:--|----:|---------:|---------:| |A | 3.15| 30| 0.1| |B | 6.5| 30| 5.8| |B | 12.3| 40| 5.8| |C | 3.3| 110| 0.2|
where time now is an average of the integrated observations, and intensity is the sum of them. The ID "B" keeps two observations, since its time difference is bigger than 0.3. I have tried with dplyr, but summarise will always drop one of the observations of "B", and I want to keep them, and I don't know how to do a conditional _group_by_.
I thank you for any idea!!
Upvotes: 4
Views: 3968
Reputation: 13580
Using sqldf
:
library(sqldf)
sqldf('SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif FROM df
GROUP BY ID
HAVING (MAX(time)-MIN(time))<0.3
UNION
SELECT ID, df.time, df.intensity, df2.dif
FROM (SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif
FROM df
GROUP BY ID
HAVING (MAX(time)-MIN(time))>0.3) as df2
LEFT JOIN df USING (ID)')
Output:
ID time intensity dif
1 A 3.15 30 0.1
2 B 6.50 30 5.8
3 B 12.30 40 5.8
4 C 3.30 110 0.2
Upvotes: 1
Reputation: 2907
# get time.diff
df$time.diff <- ave(x = df$time,df$ID,FUN = function(x){max(x)-min(x)})
# new split variable to use with ID
df$cut <- cumsum(df$time.diff > .3)
# aggregate everything you need and ignore the cut variable
require(plyr)
ddply(df,c('cut','ID'),summarize,
time = mean(time),
intensity = sum(intensity),
time.diff = mean(time.diff))[2:5]
Upvotes: 1
Reputation: 887881
A possible option with data.table
library(data.table)
unique(setDT(df)[, time.diff := max(time)-min(time), ID][
time.diff <= 0.3, c('time', 'intensity') := list(mean(time),
sum(intensity)), ID])
# ID time intensity time.diff
#1: A 3.15 30 0.1
#2: B 6.50 30 5.8
#3: B 12.30 40 5.8
#4: C 3.30 110 0.2
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(time.diff=max(time)-min(time), indx=all(time.diff<=0.3),
intensity=ifelse(indx, sum(intensity), intensity),
time=ifelse(indx, mean(time), time)) %>%
filter(!indx|row_number()==1) %>%
select(-indx)
# ID time intensity time.diff
#1 A 3.15 30 0.1
#2 B 6.50 30 5.8
#3 B 12.30 40 5.8
#4 C 3.30 110 0.2
Upvotes: 3
Reputation: 118889
Another variation of data.table
solution:
setDT(df)[, time.diff := max(time) - min(time), by = ID
][, if (time.diff <= 0.3)
.(time = mean(time), intensity = sum(intensity))
else .SD, by = .(ID, time.diff)]
# ID time.diff time intensity
# 1: A 0.1 3.15 30
# 2: B 5.8 6.50 30
# 3: B 5.8 12.30 40
# 4: C 0.2 3.30 110
Upvotes: 3