liguang
liguang

Reputation: 161

Remove Nested LOOP

I have to remove loops of my function. This is not easy because I have difficult structure of data and I don't know how can use apply family in it.

First of all, I have this structure of data

List <- List DATA 1 <- List DATA 2

Inside it this lists, I have other lists with TRAIN and TEST. Finally, I have data.frames in theses levels. I create simul data of my list with iris dataset.

data(iris)
head(iris)

iristest<-head(iris)

train<-list(iris,iris,iris)
test<-list(iristest,iristest,iristest)

list1<-list(train,test)
names(list1)<-c("train","test")


iris2<-iris
iris2[,1:4]<-iris[,1:4]+5
iristest2<-head(iris2)

train<-list(iris2,iris2,iris2)
test<-list(iristest2,iristest2,iristest2)

list2<-list(train,test)
names(list2)<-c("train","test")

flist<-list(list1,list2)
names(flist)<-c("iris","iris2")

Now, I created a function than i want to apply of my list.

Kmax<-5
nd<-10
ks<-seq(from=1,to=Kmax,by=1)
kn<-seq(1:nd)

findKNN<-function(listdf,seeds){
  indx<-1

  outs<-matrix(0, nrow = 5*length(listdf[[1]]), ncol = 3)

  for (i in seq_along(listdf[[1]])){
    for (K in 1:5){
      train<- as.data.frame(listdf$train[i])
      test <- as.data.frame(listdf$test[i])

      set.seed(seeds)

      kpreds <- knn(train[,-ncol(train)],test[,-ncol(test)], train[,ncol(train)],k=K)
      Ktable <-table(kpreds ,test[,ncol(test)])

      outs[indx,1] <- (Ktable[1, 2] + Ktable[2, 1]) / sum(Ktable)
      outs[indx,2] <- K
      outs[indx,3] <- i
      indx<-indx+1
    }
  }

  outs<-data.frame(outs)
  names(outs)<-c("error","K","I")
  outs<-aggregate(error ~ K,outs, mean)
}

output<-lapply(flist,seeds=12345,findKNN)

But I dont know how I can run this code effienctly.

Thanks

Upvotes: 0

Views: 500

Answers (3)

Aaron - mostly inactive
Aaron - mostly inactive

Reputation: 37814

The place to start is by factoring your code into chunks, where each new function works on each level of the data. Then you can call each piece from the other and collect the results in a more idiomatic way.

Here I made functions for 1) the core code for each train/test pair, 2) repeating that for each desired K, and 3) repeating that across the possible pairs.

I agree with @Deja that restructuring your data to a more "tidyverse" style method could result in even more intuitive code, but if you're not used to thinking in that way, this is probably clearer.

## run core code for a particular train/test pair
run1 <- function(train, test, K, seeds) {
  set.seed(seeds)  
  train <- as.data.frame(train)
  test <- as.data.frame(test)
  kpreds <- class::knn(train[, -ncol(train)],test[,-ncol(test)], train[,ncol(train)],k=K)
  Ktable <- table(kpreds ,test[, ncol(test)])
  (Ktable[1, 2] + Ktable[2, 1]) / sum(Ktable)
}

## run a particular train/test pair at several values of K
runK <- function(train, test, Ks, seeds) {
  errors <- sapply(Ks, function(K) run1(train, test, K, seeds))
  data.frame(K=Ks, error=errors)
}

## test several train/test pairs, at several values of K
findKNN <- function(df, Ks=1:5, seeds){
  stopifnot(length(df$train)==length(df$test))
  out <- lapply(seq_along(df$train), function(i) {
    cbind(i=i, runK(df$train[[i]], df$test[[i]], Ks, seeds))
  })
  out <- do.call(rbind, out)
  aggregate(error ~ K, out, mean)
}

## loop over several sets of data
output <- lapply(flist, seeds=12345, findKNN)

To put the data in a more "tidy" format, you'd have one row per test/train pair with additional columns for which data set and which rep it is. A little awkward to get there from what you started with, but here's what it would look like.

n <- sapply(lapply(flist, `[[`, "train"), length)
ftrain <- do.call(c, lapply(flist, `[[`, "train"))
ftest <- do.call(c, lapply(flist, `[[`, "test"))
nn <- rep(names(n), n)
ii <- unlist(lapply(n, function(i) seq_len(i)))
library(tidyverse)
alld <- tibble(data=nn, i=ii, train=ftrain, test=ftest)
alld
## # A tibble: 6 x 4
##   data      i train                  test                
##   <chr> <int> <list>                 <list>              
## 1 iris      1 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 2 iris      2 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 3 iris      3 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 4 iris2     1 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 5 iris2     2 <data.frame [150 x 5]> <data.frame [6 x 5]>
## 6 iris2     3 <data.frame [150 x 5]> <data.frame [6 x 5]>

You'd then loop through each row. (Note to make this work I had to make the result of runK be a data.frame.)

out <- alld %>% mutate(error=map2(train, test, runK, Ks=1:5, seeds=12345))
out
## # A tibble: 6 x 5
##   data      i train                  test                 error               
##   <chr> <int> <list>                 <list>               <list>              
## 1 iris      1 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 2 iris      2 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 3 iris      3 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 4 iris2     1 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 5 iris2     2 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>
## 6 iris2     3 <data.frame [150 x 5]> <data.frame [6 x 5]> <data.frame [5 x 2]>

You then take out the original data, "unnest" the error data.frame, and summarize over data set and K.

out %>% select(-train, -test) %>% unnest() %>% 
  group_by(data, K) %>% summarize(error=mean(error))
## # A tibble: 10 x 3
## # Groups:   data [?]
##    data      K error
##    <chr> <int> <dbl>
##  1 iris      1     0
##  2 iris      2     0
##  3 iris      3     0
##  4 iris      4     0
##  5 iris      5     0
##  6 iris2     1     0
##  7 iris2     2     0
##  8 iris2     3     0
##  9 iris2     4     0
## 10 iris2     5     0

Upvotes: 1

Jet
Jet

Reputation: 690

The apply functions actually do not have efficiency advantage over the for loops anymore, according to this thread.

If your goal is only to decrease runtime, then there may be no point converting the loops to apply functions. The advantage of these functions is now mainly to produce more readable code.

Upvotes: 1

Deja
Deja

Reputation: 35

This is just a stab in the dark but it seems to me the reason for the two loops is that you have structured the data as lists inside a list? Possibly lists inside lists inside a list? To me this seems to be the bigger issue then the for loops not being efficient.

Just an idea, but maybe restructure how your data is stored to something like a map where you can relate values to keys. So example is you have a map with keys “list1” “list2” and all values in map are paired to their key. Then you only need one for loop with an if that says if keys match what i want take data. Just a thought.

Upvotes: 1

Related Questions