89_Simple
89_Simple

Reputation: 3805

R: using apply family instead of for-loops for data frame

First, some sample data:

location <- c("A","B","C","D","E")
mat <- as.data.frame(matrix(runif(1825),nrow=5,ncol=365))
t1<- c(258,265,306,355)
t2<- c(258,270,302,352)
t3<- c(258,275,310,353)
t4<- c(258,280,303,355)
t5<- c(258,285,312,356)
ts<-rbind(t1,t2,t3,t4,t5)
dat <-as.data.frame(cbind(location,mat,ts))
names(dat)[367:370] <- c("pl","vg","re","me")

location are the names of sites. V1 to V365 is the daily rainfall (with V1 as the first day of the year). What I want to do is:

For each row (location), I want to produce three rainfall values based on the last four columns pl,vg, re,me (which specfies days of the year)

For example, for location A, the last four columns are:

pl = 258 vg = 265 re = 306 me= 355

Therefore, for location A, I want to produce three rainfall values which are sum of rainfall from:

V258 to V264

V265 to V305 and

V306 to V355

And do it for all the five locations.

What I did was:

 for(j in unique(dat$location)){

    loc <- dat[dat$location == j,]

    pl.val <- loc$pl + 1 # have to add + 1 since the rainfall starts from the second column
   vg.val <- loc$vg + 1
   re.val <- loc$re + 1
   me.val <- loc$me + 1

   rain1 <- sum(loc[,pl.val:vg.val]) 
   rain2 <- sum(loc[,(vg.val+ 1):re.val]) 
   rain3 <- sum(loc[,(re.val + 1):me.val]) 
}     

I want to avoid using for loop and use the apply function instead. However, I am not familiar with how to use the apply function to do the calculation for all the rows (location) in one go. Can anyone advise me how to go about it?

Thanks

EDIT

If I have one of those locations where rainfall values are NA and the other dates are NAs, how do I modify the code that is accepted as answer below. Here's the sample data

location <- c("A","B","C")
mat <- as.data.frame(matrix(runif(365*3),nrow=3,ncol=365))
t1<- c(258,265,306,355)
t2<- c(258,NA,NA,NA)
t3<- c(258,275,310,353)
ts<-rbind(t1,t2,t3)
dat <-as.data.frame(cbind(location,mat,ts))
names(dat)[367:370] <- c("pl","vg","re","me")
dat[2,-c( 367:370)] <- NA

Upvotes: 1

Views: 201

Answers (2)

cuttlefish44
cuttlefish44

Reputation: 6776

I assume that you want speed.

I think the form of your data isn't good to calculate, because only col1 is character, col367:370 is different in kind, and very wide. Maybe it isn't good idea to calculate row by row. Basically R is good with calculating col by col.

If I were you, I would prepare the data like below form;

library(tidyverse)

dat1 <- dat[, -c(1, 367:370)] %>% 
  t() %>% 
  as.tibble() %>% 
  set_names(location)

dat2 <- dat[, 367:370] %>% 
  t() %>% 
  as.tibble() %>% 
  set_names(location)

And I recommend map2() to calculates each pair of cols. .x is each col of dat1 and .y is each col of dat2 (they are treated as vectors). Below code is fifty times as fast as yours.

map2(dat1, dat2, ~ {
  pl.val <- .y[1]
  vg.val <- .y[2]
  re.val <- .y[3]
  me.val <- .y[4]

  rain1 <- sum(.x[pl.val:vg.val]) 
  rain2 <- sum(.x[(vg.val+ 1):re.val]) 
  rain3 <- sum(.x[(re.val + 1):me.val]) 
  c(rain1 = rain1, rain2 = rain2, rain3 = rain3)
  }
)


[additionnl (apply, mapply)]

Note: It is difficult for apply() to treat data.frame having character and numeric because of transforming to matrix. So if you use apply(), it is needed to delete a location col.

apply(dat[,-1], MARGIN = 1, function(x){
  pl.val <- x[367 - 1]
  vg.val <- x[368 - 1]
  re.val <- x[369 - 1]
  me.val <- x[370 - 1]

  rain1 <- sum(x[pl.val:vg.val]) 
  rain2 <- sum(x[(vg.val+ 1):re.val]) 
  rain3 <- sum(x[(re.val + 1):me.val]) 
  c(rain1 = rain1, rain2 = rain2, rain3 = rain3)
})

mapply() is mostly the same as map2(). In this question, mapply() gives the best performance.

mapply(function(.x, .y){
  pl.val <- .y[1]
  vg.val <- .y[2]
  re.val <- .y[3]
  me.val <- .y[4]

  rain1 <- sum(.x[pl.val:vg.val]) 
  rain2 <- sum(.x[(vg.val+ 1):re.val]) 
  rain3 <- sum(.x[(re.val + 1):me.val]) 
  c(rain1 = rain1, rain2 = rain2, rain3 = rain3)
  }, dat1, dat2)

[benchmark]

Unit: microseconds
             expr       min        lq       mean     median        uq       max neval cld
 forloop_method() 14154.075 15074.555 17110.4060 16588.1200 18416.387 25869.836   100   c
    map2_method()   205.586   234.263   325.8762   313.9395   333.633  2072.911   100 a  
   apply_method()  1617.443  1684.812  1913.9187  1783.2480  1933.216  4189.687   100  b 
  mapply_method()   154.972   185.079   213.9370   210.2300   225.978   468.690   100 a  


[additional2 (error handling)]

Below code is almost as fast as above code when there isn't NA. (Note: If it is in one line, you can omit {} of if(...) { A } else { B }, such as if(...) A else B.)

results <- map2(dat1, dat2, ~ {
  pl.val <- .y[1]
  vg.val <- .y[2]
  re.val <- .y[3]
  me.val <- .y[4]

  rain1 <- if(is.na(pl.val) | is.na(vg.val)) NA else sum(.x[pl.val:vg.val], na.rm = T)
  rain2 <- if(is.na(vg.val) | is.na(re.val)) NA else sum(.x[(vg.val+ 1):re.val], na.rm = T)
  rain3 <- if(is.na(re.val) | is.na(me.val)) NA else sum(.x[(re.val + 1):me.val], na.rm = T)
  c(rain1 = rain1, rain2 = rain2, rain3 = rain3)
  }
)

# If you want data.frame instead of list
invoke("rbind", results)

Upvotes: 1

Carl Boneri
Carl Boneri

Reputation: 2722

I wasn't sure how you wanted the returned rain days? Are they to be bound as 3 new columns?

Basically, here's the code... i'll walk through: For each row in your dat data.frame, select the columns that represent the days, and then build a sequence of those numeric corresponding values, but step-down the next value so that we get the correct columns each time. Since we're now operating on each locations slice of the data, convert values to numeric, and sum the corresponding columns in our apply step. use ?sprintf to append a V to each column number we get from our sequence creations, and return as a list. I then simply named the list vectors with the corresponding location's ID... if you wanted to append it to the data.frame it would be simple as well.

lapply(1:nrow(dat), function(i){
    d_idx <- dat[i,] %>% dplyr::select(dplyr::matches("pl|vg|re|me"))
    a_idx <- data.frame(
        s = as.numeric(d_idx[,1:3]), 
        e = c(as.numeric(d_idx[,2:3]) - 1, as.numeric(d_idx[[4]]))
    )
    as.list(apply(a_idx, 1, function(j){
        rowSums(dat[i, sprintf('V%s', seq(min(j),max(j)))])
    })) %>% setNames(sprintf('rain%s', 1:length(.)))
}) %>% setNames(dat$location)


$A
$A$rain1
[1] 2.391448

$A$rain2
[1] 21.58306

$A$rain3
[1] 27.805


$B
$B$rain1
[1] 5.339885

$B$rain2
[1] 16.57476

$B$rain3
[1] 26.37708


$C
$C$rain1
[1] 7.929777

$C$rain2
[1] 17.81324

$C$rain3
[1] 20.12217


$D
$D$rain1
[1] 9.715258

$D$rain2
[1] 11.2547

$D$rain3
[1] 25.93332


$E
$E$rain1
[1] 12.81343

$E$rain2
[1] 15.41595

$E$rain3
[1] 21.79217

Upvotes: 1

Related Questions