Reputation: 6179
This problem might be trivial but I am finding it difficult to solve it. Please guide me.
Following is sample data:
structure(list(Vehicle.ID2 = c("39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25"
), OC_DV = c(".", ".", ".", ".", ".", "CLDV", ".", ".", ".",
".", ".", ".", ".", ".", ".", "OPDV", ".", ".", ".", ".", ".",
".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".",
".", "CLDV", ".", ".", "."), frspacing = c(35.83373, 35.75742,
35.70391, 35.67694, 35.67792, 35.70669, 35.7619, 35.84096, 35.93962,
36.05109, 36.16704, 36.28056, 36.3861, 36.47762, 36.5485, 36.59359,
36.61402, 36.61791, 36.61383, 36.60651, 36.59694, 36.58372, 36.56525,
36.54044, 36.50771, 36.46458, 36.40831, 36.33713, 36.25086, 36.15089,
36.04004, 35.92236, 35.80322, 35.68935, 35.58883, 35.51032, 35.4618,
35.4492, 35.47479)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-39L), .Names = c("Vehicle.ID2", "OC_DV", "frspacing"))
I want to find the max and min of set of values in frspacing
between the labels CLDV
and OPDV
in the column OC_DV
. Then I want to find their difference.
Following are max and mins:
Group Max Min
1 CLDV-OPDV 36.54 35.70
2 OPDV-CLDV 36.62 35.59
Following are the absolute differences (max of 1st grp - min of 2nd group AND vice versa):
1 0.95
2 0.92
I don't have any code to show what I tried because honestly I don't know how to approach this problem. Obviously simple max
or min
by column won't work. I am using dplyr
and didn't find anything relevant.
Upvotes: 2
Views: 1941
Reputation: 7607
Below is a base R solution:
MaxMinSeq <- function(df) {
myInd <- which(df$OC_DV != ".")
myVals <- df$frspacing
myTitles <- df$OC_DV[myInd]
myLen <- length(myInd)-1L
NewDf <- as.data.frame(t(sapply(1:myLen, function(x) {
list(Group = paste(c(myTitles[x],"-",myTitles[x+1L]), collapse = ""),
Max = max(myVals[myInd[x]:(myInd[x+1L]-1L)]),
Min = min(myVals[myInd[x]:(myInd[x+1L]-1L)]))})))
for (i in 1:3) {NewDf[,i] <- unlist(NewDf[,i])}
NewDf
}
df2 <- MaxMinSeq(df)
df2
Group Max Min
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883
This is a good bit faster than the dplyr
solution posted above. Observe:
TestDplyr <- function(df) {
df[df=="."] <- NA
df$group <- paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")
df$group2 <- NA
df$group2[which(df$group != lag(df$group))] <- 1:length(which(df$group != lag(df$group)))
df$group2 <- na.locf(df$group2, na.rm = FALSE)
df %>% group_by(group, group2) %>%
summarise(Max = max(frspacing), Min = min(frspacing)) %>%
filter(!grepl("NA",group ))
}
microbenchmark(Joseph = MaxMinSeq(df), Cabana = TestDplyr(df))
Unit: microseconds
expr min lq mean median uq max neval
Joseph 338.671 377.6695 405.0257 405.9945 429.188 496.718 100
Cabana 2622.336 2698.2810 2890.5430 2765.6045 2977.427 7772.180 100
Here is a really big example:
myDfs <- lapply(1:10000, function(x) df)
bigDf <- do.call(rbind, myDfs)
bigDf$frspacing[40:nrow(bigDf)] <- runif((nrow(bigDf)-39), 10, 100)
a <- MaxMinSeq(bigDf)
b <- TestDplyr(bigDf)
b <- b[order(b$group2),]
identical(a$Max, b$Max)
[1] TRUE
identical(a$Min, b$Min)
[1] TRUE
system.time(TestDplyr(bigDf))
user system elapsed
1.54 0.00 1.54
system.time(MaxMinSeq(bigDf))
user system elapsed
0.3 0.0 0.3
As for the second part of the question, I'm not sure how general the OP would like the answer to be especially when there are more than two different final pairings. For example does the OP want to find the max of one row and compare that to the min of the min of all rows, or do we simply compare neighbors? The function below takes the first approach (i.e. the general approach).
GetDiff <- function(df) {
df2 <- cbind(df, t(sapply(1:nrow(df), function(x) {
c(rowMin = min(df[x,2:3]),
rowMax = max(df[x,2:3]))})))
myRows <- 1:nrow(df)
sapply(myRows, function(x) df2$rowMax[x] - min(df2$rowMin[-x]))
}
GetDiff(df2) ## df2 comes from above
[1] 0.95967 0.91122
Upvotes: 1
Reputation: 419
library(zoo) # for na.locf
library(dplyr)
df[df=="."] = NA
df$group = paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")
df %>% group_by(group) %>%
summarise(Max = max(frspacing), Min = min(frspacing)) %>%
filter(!grepl("NA",group ))
Source: local data frame [2 x 3]
group Max Min
(chr) (dbl) (dbl)
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883
With multiple values I would count the changes und use it as another grouping variable: (I duplicated the data in this example)
df$group2 = NA
df$group2[which(df$group != lag(df$group))] = 1:length(which(df$group != lag(df$group)))
df$group2 = na.locf(df$group2, na.rm = FALSE)
df %>% group_by(group, group2) %>%
summarise(Max = max(frspacing), Min = min(frspacing)) %>%
filter(!grepl("NA",group ))
Source: local data frame [5 x 4]
Groups: group [3]
group group2 Max Min
(chr) (int) (dbl) (dbl)
1 CLDV-CLDV 3 38.09082 34.30454
2 CLDV-OPDV 1 36.54850 35.70669
3 CLDV-OPDV 4 38.90356 34.08951
4 OPDV-CLDV 2 36.61791 35.58883
5 OPDV-CLDV 5 38.18983 34.27874
But if the combination of OC_DV
is distinct in every Vehicle.ID2
you can simply paste the ID in group...
Upvotes: 4
Reputation: 10671
d <- your_dput
# Build your subsetted dataframes
e <- d[grep("CLDV", d$OC_DV)[1]: grep("OPDV", d$OC_DV),]
f <- d[(grep("OPDV", d$OC_DV): grep("CLDV", d$OC_DV)[2]),]
# Make the diff() calls
diff(c(max(e$frspacing), min(f$frspacing)))
diff(c(max(f$frspacing), min(e$frspacing)))
My values are not quiet the same as yours, you can adjust the grep values manually depending on how you want to handle boundary inclusion/exclusion.
Upvotes: 1