Reputation: 131
I have a data frame that looks like this in R:
Date | Time | value
A | 1 | 3
A | 1 | 6
A | 2 | 4
A | 3 | 3
A | 4 | 2
A | 5 | 7
B | 1 | 6
B | 2 | 5
B | 2 | 3
B | 2 | 4
B | 3 | 2
B | 5 | 3
B | 6 | 4
...
My aim is to find the median of the numbers in the third column if the numbers in the second columns are equal in consecutive rows. I.e take the median of the values if they occur at the same time, and replacing that median value for that corresponding time slot.
So output I am aiming for:
A | 1 | median (3,6)
A | 2 | 4
A | 3 | 3
A | 4 | 2
A | 5 | 7
B | 1 | 6
B | 2 | median (3,4,5)
B | 3 | 2
B | 5 | 3
B | 6 | 4
...
I am desperately trying to avoid loops since the data set is huge. The main problem I am having is collecting the values separately. This is what I had thus far:
#First find consecutive time slots that are equal:
timeslots_equal<-which(diff(data_RAW$TIME)==0)
coordinates_placesholder <- sort(c(as.vector(timestamp_equal_coordinates), as.vector(timestamp_equal_coordinates)+1))
coordinates_placesholder2 <- coordinates_placesholder[-c(which(diff(coordinates_placesholder)==0), which(diff(coordinates_placesholder)==0) +1)]
#The following matrix are the coordinates in the value vector with equal time slots
matrix_ranges<-t(matrix(coordinates_placesholder2,2))
matrix_ranges
for the example above would look like:
1 | 2
8 | 10
Then I tried to apply something like
median(data_RAW$Value[matrix_ranges[,1]:matrix_ranges[,2]])
This did not work. Does anyone have any answers on doing this?
Also is there an easier way to doing this than what I did above?
Upvotes: 3
Views: 1957
Reputation: 193507
Two interpretations come to mind.
Interpretation 1: It's the combination of "Date" + "Time" that matters, not the consecutive repetition. In this case, just use aggregate
(or your favorite aggregating function or package, like "data.table").
aggregate(value ~ Date + Time, mydf, median)
# Date Time value
# 1 A 1 4.5
# 2 B 1 6.0
# 3 A 2 4.0
# 4 B 2 4.0
# 5 A 3 3.0
# 6 B 3 2.0
# 7 A 4 2.0
# 8 A 5 7.0
# 9 B 5 3.0
# 10 B 6 4.0
Interpretation 2: The consecutive repetitions are important. In this case, you need another "grouping" variable. For this, we can use rle
. After that, the aggregation step is pretty much the same.
RLE <- rle(DF$Time)$lengths
RLE <- rep(seq_along(RLE), RLE)
aggregate(value ~ Date + Time + RLE, DF, median)
# Date Time RLE value
# 1 A 1 1 4.5
# 2 A 2 2 4.0
# 3 A 3 3 3.0
# 4 A 4 4 2.0
# 5 A 5 5 7.0
# 6 B 1 6 6.0
# 7 B 2 7 4.0
# 8 B 3 8 2.0
# 9 B 5 9 3.0
# 10 B 6 10 4.0
# 11 A 1 11 3.0
# 12 B 3 12 2.0
For the benefit of others, here's some reproducible data: mydf
and DF
. (DF
is just mydf
with a few rows repeated.)
mydf <- structure(list(Date = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "B"), Time = c(1L, 1L, 2L, 3L, 4L, 5L, 1L,
2L, 2L, 2L, 3L, 5L, 6L), value = c(3L, 6L, 4L, 3L, 2L, 7L, 6L,
5L, 3L, 4L, 2L, 3L, 4L)), .Names = c("Date", "Time", "value"),
class = "data.frame", row.names = c(NA, -13L))
DF <- rbind(mydf, mydf[c(1, 1, 11, 11), ])
Upvotes: 3