Reputation: 3805
First, some sample data:
location <- c("A","B","C","D","E")
mat <- as.data.frame(matrix(runif(1825),nrow=5,ncol=365))
t1<- c(258,265,306,355)
t2<- c(258,270,302,352)
t3<- c(258,275,310,353)
t4<- c(258,280,303,355)
t5<- c(258,285,312,356)
ts<-rbind(t1,t2,t3,t4,t5)
dat <-as.data.frame(cbind(location,mat,ts))
names(dat)[367:370] <- c("pl","vg","re","me")
location
are the names of sites. V1
to V365
is the daily rainfall (with V1
as
the first day of the year). What I want to do is:
For each row (location
), I want to produce three rainfall values based on the last
four columns pl
,vg
, re
,me
(which specfies days of the year)
For example, for location A
, the last four columns are:
pl
= 258
vg
= 265
re
= 306
me
= 355
Therefore, for location A
, I want to produce three rainfall values which are sum of rainfall from:
V258
to V264
V265
to V305
and
V306
to V355
And do it for all the five locations.
What I did was:
for(j in unique(dat$location)){
loc <- dat[dat$location == j,]
pl.val <- loc$pl + 1 # have to add + 1 since the rainfall starts from the second column
vg.val <- loc$vg + 1
re.val <- loc$re + 1
me.val <- loc$me + 1
rain1 <- sum(loc[,pl.val:vg.val])
rain2 <- sum(loc[,(vg.val+ 1):re.val])
rain3 <- sum(loc[,(re.val + 1):me.val])
}
I want to avoid using for
loop and use the apply
function instead. However, I am
not familiar with how to use the apply function to do the calculation for all the rows
(location) in one go. Can anyone advise me how to go about it?
Thanks
EDIT
If I have one of those locations where rainfall values are NA and the other dates are NAs, how do I modify the code that is accepted as answer below. Here's the sample data
location <- c("A","B","C")
mat <- as.data.frame(matrix(runif(365*3),nrow=3,ncol=365))
t1<- c(258,265,306,355)
t2<- c(258,NA,NA,NA)
t3<- c(258,275,310,353)
ts<-rbind(t1,t2,t3)
dat <-as.data.frame(cbind(location,mat,ts))
names(dat)[367:370] <- c("pl","vg","re","me")
dat[2,-c( 367:370)] <- NA
Upvotes: 1
Views: 201
Reputation: 6776
I assume that you want speed.
I think the form of your data isn't good to calculate, because only col1 is character, col367:370 is different in kind, and very wide. Maybe it isn't good idea to calculate row by row. Basically R is good with calculating col by col.
If I were you, I would prepare the data like below form;
library(tidyverse)
dat1 <- dat[, -c(1, 367:370)] %>%
t() %>%
as.tibble() %>%
set_names(location)
dat2 <- dat[, 367:370] %>%
t() %>%
as.tibble() %>%
set_names(location)
And I recommend map2()
to calculates each pair of cols. .x
is each col of dat1
and .y
is each col of dat2
(they are treated as vectors). Below code is fifty times as fast as yours.
map2(dat1, dat2, ~ {
pl.val <- .y[1]
vg.val <- .y[2]
re.val <- .y[3]
me.val <- .y[4]
rain1 <- sum(.x[pl.val:vg.val])
rain2 <- sum(.x[(vg.val+ 1):re.val])
rain3 <- sum(.x[(re.val + 1):me.val])
c(rain1 = rain1, rain2 = rain2, rain3 = rain3)
}
)
[additionnl (apply, mapply)]
Note: It is difficult for apply()
to treat data.frame
having character and numeric because of transforming to matrix. So if you use apply()
, it is needed to delete a location col.
apply(dat[,-1], MARGIN = 1, function(x){
pl.val <- x[367 - 1]
vg.val <- x[368 - 1]
re.val <- x[369 - 1]
me.val <- x[370 - 1]
rain1 <- sum(x[pl.val:vg.val])
rain2 <- sum(x[(vg.val+ 1):re.val])
rain3 <- sum(x[(re.val + 1):me.val])
c(rain1 = rain1, rain2 = rain2, rain3 = rain3)
})
mapply()
is mostly the same as map2()
. In this question, mapply()
gives the best performance.
mapply(function(.x, .y){
pl.val <- .y[1]
vg.val <- .y[2]
re.val <- .y[3]
me.val <- .y[4]
rain1 <- sum(.x[pl.val:vg.val])
rain2 <- sum(.x[(vg.val+ 1):re.val])
rain3 <- sum(.x[(re.val + 1):me.val])
c(rain1 = rain1, rain2 = rain2, rain3 = rain3)
}, dat1, dat2)
[benchmark]
Unit: microseconds
expr min lq mean median uq max neval cld
forloop_method() 14154.075 15074.555 17110.4060 16588.1200 18416.387 25869.836 100 c
map2_method() 205.586 234.263 325.8762 313.9395 333.633 2072.911 100 a
apply_method() 1617.443 1684.812 1913.9187 1783.2480 1933.216 4189.687 100 b
mapply_method() 154.972 185.079 213.9370 210.2300 225.978 468.690 100 a
[additional2 (error handling)]
Below code is almost as fast as above code when there isn't NA. (Note: If it is in one line, you can omit {}
of if(...) { A } else { B }
, such as if(...) A else B
.)
results <- map2(dat1, dat2, ~ {
pl.val <- .y[1]
vg.val <- .y[2]
re.val <- .y[3]
me.val <- .y[4]
rain1 <- if(is.na(pl.val) | is.na(vg.val)) NA else sum(.x[pl.val:vg.val], na.rm = T)
rain2 <- if(is.na(vg.val) | is.na(re.val)) NA else sum(.x[(vg.val+ 1):re.val], na.rm = T)
rain3 <- if(is.na(re.val) | is.na(me.val)) NA else sum(.x[(re.val + 1):me.val], na.rm = T)
c(rain1 = rain1, rain2 = rain2, rain3 = rain3)
}
)
# If you want data.frame instead of list
invoke("rbind", results)
Upvotes: 1
Reputation: 2722
I wasn't sure how you wanted the returned rain days? Are they to be bound as 3 new columns?
Basically, here's the code... i'll walk through:
For each row in your dat
data.frame, select the columns that represent the days, and then build a sequence of those numeric corresponding values, but step-down the next value so that we get the correct columns each time. Since we're now operating on each locations slice
of the data, convert values to numeric, and sum the corresponding columns in our apply
step. use ?sprintf
to append a V
to each column number we get from our sequence creations, and return as a list. I then simply named the list vectors with the corresponding location's ID... if you wanted to append it to the data.frame it would be simple as well.
lapply(1:nrow(dat), function(i){
d_idx <- dat[i,] %>% dplyr::select(dplyr::matches("pl|vg|re|me"))
a_idx <- data.frame(
s = as.numeric(d_idx[,1:3]),
e = c(as.numeric(d_idx[,2:3]) - 1, as.numeric(d_idx[[4]]))
)
as.list(apply(a_idx, 1, function(j){
rowSums(dat[i, sprintf('V%s', seq(min(j),max(j)))])
})) %>% setNames(sprintf('rain%s', 1:length(.)))
}) %>% setNames(dat$location)
$A
$A$rain1
[1] 2.391448
$A$rain2
[1] 21.58306
$A$rain3
[1] 27.805
$B
$B$rain1
[1] 5.339885
$B$rain2
[1] 16.57476
$B$rain3
[1] 26.37708
$C
$C$rain1
[1] 7.929777
$C$rain2
[1] 17.81324
$C$rain3
[1] 20.12217
$D
$D$rain1
[1] 9.715258
$D$rain2
[1] 11.2547
$D$rain3
[1] 25.93332
$E
$E$rain1
[1] 12.81343
$E$rain2
[1] 15.41595
$E$rain3
[1] 21.79217
Upvotes: 1