Joshua
Joshua

Reputation: 752

R: rewrite loop with apply

I have the following type of data set:

id;2011_01;2011_02;2011_03; ... ;2001_12
id01;NA;NA;123; ... ;NA
id02;188;NA;NA; ... ;NA

That is, each row is unique customer and each column depicts a trait for this customer from the past 10 years (each month has its own column). The thing is that I want to condense this 120 column data frame into a 10 column data frame, this because I know that almost all rows have (although the month itself can vary) have 1 or 0 observations from each year.

I've already done, one year at the time, this using a loop with a nested if-clause:

for(i in 1:nrow(input_data)) {
    temp_row <- input_data[i,c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
    loc2011 <- which(!is.na(temp_row))
    if(length(loc2011 ) > 0) {
        temp_row_2011[i,] <- temp_row[loc2011[1]] #pick the first observation if there are several
    } else {
        temp_row_2011[i,] <- NA
    }
}

Since my data set is quite big, and I need to perform the above loop 10 times (one for each year), this is taking way too much time. I know one is much better of using apply commands in R, so I would greatly appreciate help on this task. How could I write the whole thing (including the different years) better?

Upvotes: 1

Views: 850

Answers (2)

tim riffe
tim riffe

Reputation: 5691

Are you after something like this?:

    temp_row_2011 <- apply(input_data, 1, function(x){
        temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
        temp_row[!is.na(temp_row)][1]
    })

If this gives you the right output, and if it runs faster than your loop, then it's not necessarily due only to the fact of using an apply(), but also because it assigns less stuff and avoids an if {} else {}. You might be able to make it go even faster by compiling the anonymous function:

    reduceyear <- function(x){
        temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
        temp_row[!is.na(temp_row)][1]
    }
    # compile, just in case it runs faster:
    reduceyear_c <- compiler:::cmpfun(reduceyear)
    # this ought to do the same as the above.
    temp_row_2011 <- apply(input_data, 1, reduceyear_c)

You didn't say whether input_data is a data.frame or a matrix, but a matrix would be faster than the former (but only valid if input_data is all the same class of data).

[EDIT: full example, motivated by DWin]

    input_data <- matrix(ncol=24,nrow=10)
    # years and months:
    colnames(input_data) <- c(paste(2010,1:12,sep="_"),paste(2011,1:12,sep="_"))
    # some ids
    rownames(input_data) <- 1:10 
    # put in some values:
    input_data[sample(1:length(input_data),200,replace=FALSE)] <- round(runif(200,100,200))
    # make an all-NA case:
    input_data[2,1:12] <- NA

    # and here's the full deal:
    sapply(2010:2011, function(x,input_data){
        input_data_yr <- input_data[, grep(x, colnames(input_data) )] 
        apply(input_data_yr, 1, function(id){
                    id[!is.na(id)][1]
                }
        )
    }, input_data)

All NA case works. grep() column selection idea lifted from DWin. As in the above example, you could actually define the anonymous interior function and compile it to potentially make the thing run faster.

Upvotes: 3

IRTFM
IRTFM

Reputation: 263352

I built a tiny test case (for which timriffe's suggestion fails). You might attract more interest by putting up code that creates a more complete test case such as 4 quarters for 2 years and including pathological cases such as all NA's in one row of one year. I would think that instead of requiring you to write out all the year columns by name, that you ought to cycle through them with a grep() strategy:

  # funyear <- function to work on one year's data and return a single vector
  # my efforts keep failing on the all(NA) row by year combos
  sapply(seq("2011", "2001"), function (pat) funyear(input_data[grep(pat, names(input_data) )] )

Upvotes: 1

Related Questions