umair durrani
umair durrani

Reputation: 6179

How to find max and min within sequence of values in a column in R?

This problem might be trivial but I am finding it difficult to solve it. Please guide me.

Data

Following is sample data:

structure(list(Vehicle.ID2 = c("39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25"
), OC_DV = c(".", ".", ".", ".", ".", "CLDV", ".", ".", ".", 
".", ".", ".", ".", ".", ".", "OPDV", ".", ".", ".", ".", ".", 
".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", 
".", "CLDV", ".", ".", "."), frspacing = c(35.83373, 35.75742, 
35.70391, 35.67694, 35.67792, 35.70669, 35.7619, 35.84096, 35.93962, 
36.05109, 36.16704, 36.28056, 36.3861, 36.47762, 36.5485, 36.59359, 
36.61402, 36.61791, 36.61383, 36.60651, 36.59694, 36.58372, 36.56525, 
36.54044, 36.50771, 36.46458, 36.40831, 36.33713, 36.25086, 36.15089, 
36.04004, 35.92236, 35.80322, 35.68935, 35.58883, 35.51032, 35.4618, 
35.4492, 35.47479)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-39L), .Names = c("Vehicle.ID2", "OC_DV", "frspacing"))  

What I want to do

I want to find the max and min of set of values in frspacing between the labels CLDV and OPDV in the column OC_DV. Then I want to find their difference.

Desired Output

Following are max and mins:

  Group      Max    Min
1 CLDV-OPDV 36.54   35.70
2 OPDV-CLDV 36.62   35.59  

Following are the absolute differences (max of 1st grp - min of 2nd group AND vice versa):

1 0.95
2 0.92

I don't have any code to show what I tried because honestly I don't know how to approach this problem. Obviously simple max or min by column won't work. I am using dplyr and didn't find anything relevant.

Upvotes: 2

Views: 1941

Answers (3)

Joseph Wood
Joseph Wood

Reputation: 7607

Below is a base R solution:

MaxMinSeq <- function(df) {
    myInd <- which(df$OC_DV != ".")
    myVals <- df$frspacing
    myTitles <- df$OC_DV[myInd]
    myLen <- length(myInd)-1L
    NewDf <- as.data.frame(t(sapply(1:myLen, function(x) {
               list(Group = paste(c(myTitles[x],"-",myTitles[x+1L]), collapse = ""),
                   Max = max(myVals[myInd[x]:(myInd[x+1L]-1L)]),
                   Min = min(myVals[myInd[x]:(myInd[x+1L]-1L)]))})))
    for (i in 1:3) {NewDf[,i] <- unlist(NewDf[,i])}
    NewDf
}

df2 <- MaxMinSeq(df)
df2
      Group      Max      Min
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883

This is a good bit faster than the dplyr solution posted above. Observe:

TestDplyr <- function(df) {
    df[df=="."] <- NA
    df$group <- paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")

    df$group2 <- NA
    df$group2[which(df$group != lag(df$group))] <- 1:length(which(df$group != lag(df$group)))
    df$group2 <- na.locf(df$group2, na.rm = FALSE)

    df %>% group_by(group, group2) %>% 
        summarise(Max = max(frspacing), Min = min(frspacing)) %>% 
        filter(!grepl("NA",group ))
}

microbenchmark(Joseph = MaxMinSeq(df), Cabana = TestDplyr(df))
Unit: microseconds
expr      min        lq      mean    median       uq      max neval
Joseph  338.671  377.6695  405.0257  405.9945  429.188  496.718   100
Cabana 2622.336 2698.2810 2890.5430 2765.6045 2977.427 7772.180   100

Here is a really big example:

myDfs <- lapply(1:10000, function(x) df)
bigDf <- do.call(rbind, myDfs)
bigDf$frspacing[40:nrow(bigDf)] <- runif((nrow(bigDf)-39), 10, 100)

a <- MaxMinSeq(bigDf)
b <- TestDplyr(bigDf)
b <- b[order(b$group2),]

identical(a$Max, b$Max)
[1] TRUE
identical(a$Min, b$Min)
[1] TRUE

system.time(TestDplyr(bigDf))
 user  system elapsed 
 1.54    0.00    1.54 
system.time(MaxMinSeq(bigDf))
 user  system elapsed 
  0.3     0.0     0.3

As for the second part of the question, I'm not sure how general the OP would like the answer to be especially when there are more than two different final pairings. For example does the OP want to find the max of one row and compare that to the min of the min of all rows, or do we simply compare neighbors? The function below takes the first approach (i.e. the general approach).

GetDiff <- function(df) {
    df2 <- cbind(df, t(sapply(1:nrow(df), function(x) {
                        c(rowMin = min(df[x,2:3]),
                          rowMax = max(df[x,2:3]))})))
    myRows <- 1:nrow(df)
    sapply(myRows, function(x) df2$rowMax[x] - min(df2$rowMin[-x]))
}

GetDiff(df2)   ## df2 comes from above
[1] 0.95967 0.91122

Upvotes: 1

Cabana
Cabana

Reputation: 419

 library(zoo) # for na.locf
 library(dplyr)

 df[df=="."] = NA
 df$group = paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")

 df %>% group_by(group) %>% 
   summarise(Max = max(frspacing), Min = min(frspacing)) %>% 
   filter(!grepl("NA",group ))

Source: local data frame [2 x 3]

      group      Max      Min
      (chr)    (dbl)    (dbl)
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883

With multiple values I would count the changes und use it as another grouping variable: (I duplicated the data in this example)

df$group2 = NA
df$group2[which(df$group != lag(df$group))] = 1:length(which(df$group != lag(df$group)))
df$group2 = na.locf(df$group2, na.rm = FALSE)

df %>% group_by(group, group2) %>% 
  summarise(Max = max(frspacing), Min = min(frspacing)) %>% 
   filter(!grepl("NA",group ))

Source: local data frame [5 x 4]
Groups: group [3]

      group group2      Max      Min
      (chr)  (int)    (dbl)    (dbl)
1 CLDV-CLDV      3 38.09082 34.30454
2 CLDV-OPDV      1 36.54850 35.70669
3 CLDV-OPDV      4 38.90356 34.08951
4 OPDV-CLDV      2 36.61791 35.58883
5 OPDV-CLDV      5 38.18983 34.27874

But if the combination of OC_DV is distinct in every Vehicle.ID2 you can simply paste the ID in group...

Upvotes: 4

Nate
Nate

Reputation: 10671

d <- your_dput
# Build your subsetted dataframes
e <- d[grep("CLDV", d$OC_DV)[1]: grep("OPDV", d$OC_DV),]
f <- d[(grep("OPDV", d$OC_DV): grep("CLDV", d$OC_DV)[2]),]
# Make the diff() calls
diff(c(max(e$frspacing), min(f$frspacing)))
diff(c(max(f$frspacing), min(e$frspacing)))

My values are not quiet the same as yours, you can adjust the grep values manually depending on how you want to handle boundary inclusion/exclusion.

Upvotes: 1

Related Questions