Vasilis Vasileiou
Vasilis Vasileiou

Reputation: 527

Function that subsets a dataframe to the "best square" in R

My objective is to write a function in R that takes a dataframe as input and returns the "best square subset" of it.

By best square subset I mean that the output needs to confirm the below:

Let's take the three following examples:

example1 <- structure(list(Afternoon = c(20800L, 15254L, 17426L, 4391L, 39194L
), Evening = c(21679L, 0L, 2973L, 37L, 435L), Morning = c(0L, 
                                                          3726L, 0L, 0L, 0L)), .Names = c("Afternoon", "Evening", "Morning"
                                                          ), row.names = c("Friday", "Monday", "Thursday", "Tuesday", "Wednesday"
                                                          ), class = "data.frame")

example2 <- structure(list(Afternoon = c(1227364L, 219402L, 3L, 0L, 530891L, 
                                         153124L, 281788L), Evening = c(570618L, 167216L, 31L, 10L, 88702L, 
                                                                        161006L, 42L), Morning = c(0L, 121775L, 0L, 0L, 0L, 25133L, 270162L
                                                                        )), .Names = c("Afternoon", "Evening", "Morning"), row.names = c("Friday", 
                                                                                                                                         "Monday", "Saturday", "Sunday", "Thursday", "Tuesday", "Wednesday"
                                                                        ), class = "data.frame")

example3 <- structure(list(Afternoon = c(20800L, 258L, 300L, 563L, 2000L
), Evening = c(21679L, 0L, 2973L, 37L, 435L), Morning = c(0L, 
                                                          3726L, 0L, 0L, 0L)), .Names = c("Afternoon", "Evening", "Morning"
                                                          ), row.names = c("Friday", "Monday", "Thursday", "Tuesday", "Wednesday"
                                                          ), class = "data.frame")

That look like this:

> example1
          Afternoon Evening Morning
Friday        20800   21679       0
Monday        15254       0    3726
Thursday      17426    2973       0
Tuesday        4391      37       0
Wednesday     39194     435       0

> example2
          Afternoon Evening Morning
Friday      1227364  570618       0
Monday       219402  167216  121775
Saturday          3      31       0
Sunday            0      10       0
Thursday     530891   88702       0
Tuesday      153124  161006   25133
Wednesday    281788      42  270162

> example3
          Afternoon Evening Morning
Friday        20800   21679       0
Monday          258       0    3726
Thursday        300    2973       0
Tuesday         563      37       0
Wednesday      2000     435       0

The function I'm looking for should subset the above 3 examples to the following 3 respectively:

> output1
          Afternoon
Friday        20800
Monday        15254
Thursday      17426
Tuesday        4391
Wednesday     39194

The score/area of the square is 5. Anything else would be less. For instance, selecting Friday,Thursday Afternoon evening would yield a score of 4

> output2
         Afternoon Evening
Friday     1227364  570618
Monday      219402  167216
Thursday    530891   88702
Tuesday     153124  161006

Here, someone's first thought would be to select all Monday, Tuesday & wednesday times and get a score of 9. However, Wendesday evenings's 42 validates the first criterion. Monday & Tuesday all days and times would yield a score/area of 6

> output3
       Afternoon Evening
Friday     20800   21679

Here we have two possible squares that validate the first 2 criteria: Friday afternoon and evening or Friday and Wednesday afternoon. We have to go with the first choice as the sum inside the cells is higher than in the second case. This rule is applied only in the case of ties.

Upvotes: 0

Views: 132

Answers (1)

Meysam Torkaman
Meysam Torkaman

Reputation: 56

The most intuitive solution would be to go through all possible combinations of the rows and columns and check whether the selected rows and columns form a full square or not, if yes, then check if that form the largest number possible. A potential problem with this approach would be that if you have many columns and rows, it will take ages to finish which is not optimal. My answer to the question which comes here work reasonably well (on my PC with 16GB RAM, 2.7 GHz CPU, and Windows 10pro 64 bit, R version 3.5.1) when number of columns and rows are not very larger than 12.

#library(gtools)

find_best_square <- function(x, thresh = 2000){
    # x <- example1
    x[x<thresh] <- 0

    # for larger datasets only: removing lonely cells
    if (ncol(x) > 7 | nrow(x)> 7){
        for (i in 1:nrow(x)){
            for (j in 1:ncol(x)){
                if((colSums(x[,j,drop=F]) == x[i,j]) & (rowSums(x[i,,drop=F])==x[i,j])) x[i, j] <- 0L 
            }
        }
    }

    # remove columns with no data
    is_colZero <- colSums(x==0)== nrow(x)
    if(any(is_colZero)) print(paste('this column is empty and removed: ', which(is_colZero)))
    x <- x[,!is_colZero]

    # remove rows with no data
    is_rowZero <- rowSums(x==0)==ncol(x)
    if(any(is_rowZero)) print(paste('this row is empty and removed: ', which(is_rowZero)))
    x <- x[!is_rowZero,]

    n <- ncol(x)
    m <- nrow(x)
    max_size <- 0L
    max_sum <- 0L
    jump_i <- 0L
    jump_j <- 0L

    for (i in n:1){ # cols
        # all possible combination
        next_max <- m  * (i-1)

        if(max_size!=0 & next_max < max_size &  i * m < max_size) {
            jump_i <- jump_i + 1
            next()
        }
        comb_col <- combinations(n,i)
        for (k in 1:nrow(comb_col)){
            col <- as.integer(comb_col[k,])
            for(j in m:1){ # rows
                if (i*j < max_size ) {
                    jump_j <- jump_j +1
                    next()
                }
                comb_row <- combinations(m,j)
                for (l in 1:nrow(comb_row)){
                    row <- as.integer(comb_row[l,])
                    y <- x[row, col, drop=F]
                    if(all(y > 0) & max_size <= length(row)*length(col)){
                        if(max_size == length(row)*length(col)){
                            if(sum(y) > max_sum){ 
                                max_size <- length(row) * length(col)
                                max_cols <- col
                                max_rows <- row
                                max_sum <- sum(y)}
                        } else {
                            max_size <- length(row) * length(col)
                            max_cols <- col
                            max_rows <- row
                            max_sum <- sum(y) 
                        }

                    }

                }
            }

        }


    }   
    return(x[max_rows,max_cols, drop=F])
}

hope this will work for you, any question please email me.

Upvotes: 2

Related Questions