Reputation: 527
My objective is to write a function in R that takes a dataframe as input and returns the "best square subset" of it.
By best square subset I mean that the output needs to confirm the below:
Let's take the three following examples:
example1 <- structure(list(Afternoon = c(20800L, 15254L, 17426L, 4391L, 39194L
), Evening = c(21679L, 0L, 2973L, 37L, 435L), Morning = c(0L,
3726L, 0L, 0L, 0L)), .Names = c("Afternoon", "Evening", "Morning"
), row.names = c("Friday", "Monday", "Thursday", "Tuesday", "Wednesday"
), class = "data.frame")
example2 <- structure(list(Afternoon = c(1227364L, 219402L, 3L, 0L, 530891L,
153124L, 281788L), Evening = c(570618L, 167216L, 31L, 10L, 88702L,
161006L, 42L), Morning = c(0L, 121775L, 0L, 0L, 0L, 25133L, 270162L
)), .Names = c("Afternoon", "Evening", "Morning"), row.names = c("Friday",
"Monday", "Saturday", "Sunday", "Thursday", "Tuesday", "Wednesday"
), class = "data.frame")
example3 <- structure(list(Afternoon = c(20800L, 258L, 300L, 563L, 2000L
), Evening = c(21679L, 0L, 2973L, 37L, 435L), Morning = c(0L,
3726L, 0L, 0L, 0L)), .Names = c("Afternoon", "Evening", "Morning"
), row.names = c("Friday", "Monday", "Thursday", "Tuesday", "Wednesday"
), class = "data.frame")
That look like this:
> example1
Afternoon Evening Morning
Friday 20800 21679 0
Monday 15254 0 3726
Thursday 17426 2973 0
Tuesday 4391 37 0
Wednesday 39194 435 0
> example2
Afternoon Evening Morning
Friday 1227364 570618 0
Monday 219402 167216 121775
Saturday 3 31 0
Sunday 0 10 0
Thursday 530891 88702 0
Tuesday 153124 161006 25133
Wednesday 281788 42 270162
> example3
Afternoon Evening Morning
Friday 20800 21679 0
Monday 258 0 3726
Thursday 300 2973 0
Tuesday 563 37 0
Wednesday 2000 435 0
The function I'm looking for should subset the above 3 examples to the following 3 respectively:
> output1
Afternoon
Friday 20800
Monday 15254
Thursday 17426
Tuesday 4391
Wednesday 39194
The score/area of the square is 5. Anything else would be less. For instance, selecting Friday,Thursday Afternoon evening would yield a score of 4
> output2
Afternoon Evening
Friday 1227364 570618
Monday 219402 167216
Thursday 530891 88702
Tuesday 153124 161006
Here, someone's first thought would be to select all Monday, Tuesday & wednesday times and get a score of 9. However, Wendesday evenings's 42 validates the first criterion. Monday & Tuesday all days and times would yield a score/area of 6
> output3
Afternoon Evening
Friday 20800 21679
Here we have two possible squares that validate the first 2 criteria: Friday afternoon and evening or Friday and Wednesday afternoon. We have to go with the first choice as the sum inside the cells is higher than in the second case. This rule is applied only in the case of ties.
Upvotes: 0
Views: 132
Reputation: 56
The most intuitive solution would be to go through all possible combinations of the rows and columns and check whether the selected rows and columns form a full square or not, if yes, then check if that form the largest number possible. A potential problem with this approach would be that if you have many columns and rows, it will take ages to finish which is not optimal. My answer to the question which comes here work reasonably well (on my PC with 16GB RAM, 2.7 GHz CPU, and Windows 10pro 64 bit, R version 3.5.1) when number of columns and rows are not very larger than 12.
#library(gtools)
find_best_square <- function(x, thresh = 2000){
# x <- example1
x[x<thresh] <- 0
# for larger datasets only: removing lonely cells
if (ncol(x) > 7 | nrow(x)> 7){
for (i in 1:nrow(x)){
for (j in 1:ncol(x)){
if((colSums(x[,j,drop=F]) == x[i,j]) & (rowSums(x[i,,drop=F])==x[i,j])) x[i, j] <- 0L
}
}
}
# remove columns with no data
is_colZero <- colSums(x==0)== nrow(x)
if(any(is_colZero)) print(paste('this column is empty and removed: ', which(is_colZero)))
x <- x[,!is_colZero]
# remove rows with no data
is_rowZero <- rowSums(x==0)==ncol(x)
if(any(is_rowZero)) print(paste('this row is empty and removed: ', which(is_rowZero)))
x <- x[!is_rowZero,]
n <- ncol(x)
m <- nrow(x)
max_size <- 0L
max_sum <- 0L
jump_i <- 0L
jump_j <- 0L
for (i in n:1){ # cols
# all possible combination
next_max <- m * (i-1)
if(max_size!=0 & next_max < max_size & i * m < max_size) {
jump_i <- jump_i + 1
next()
}
comb_col <- combinations(n,i)
for (k in 1:nrow(comb_col)){
col <- as.integer(comb_col[k,])
for(j in m:1){ # rows
if (i*j < max_size ) {
jump_j <- jump_j +1
next()
}
comb_row <- combinations(m,j)
for (l in 1:nrow(comb_row)){
row <- as.integer(comb_row[l,])
y <- x[row, col, drop=F]
if(all(y > 0) & max_size <= length(row)*length(col)){
if(max_size == length(row)*length(col)){
if(sum(y) > max_sum){
max_size <- length(row) * length(col)
max_cols <- col
max_rows <- row
max_sum <- sum(y)}
} else {
max_size <- length(row) * length(col)
max_cols <- col
max_rows <- row
max_sum <- sum(y)
}
}
}
}
}
}
return(x[max_rows,max_cols, drop=F])
}
hope this will work for you, any question please email me.
Upvotes: 2