I have large set of combinations of tables used in data warehouse queries. Example: QueryID Table -------- ------ 1 t1 1 t2 1 t3 -------------- 2 t1 2 t3 -------------- 3 t4 3 t2 -------------- 4 t2 4 t3 4 t4 -------------- 5 t3 5 t1 ...and more.... (around thousands of such combinations). My current analysis involves finding patterns out of this like most frequently used together tables etc. My next analysis point is to find maximum number of queries that can be run by using minimum number of tables and also taking the tables' sizes into consideration. For example from the above data, we can have minimum of 2 queries run by using three tables combination (t1,t2,t3) and (t1,t3) and (t2,t3,t4) etc.... For example if the sizes of the tables are table size ----- ----- t1 20 GB t2 40 GB t3 10 GB t4 50 GB Then (t1,t2,t3) together sized 70 GB can run three queries (t1,t3) together sized 30 GB can run two queries (t2,t3,t4) together sized 100 GB can run two queries Out of these (t1,t3) is the best combination of tables with minimum size and count that can run two queries. I am trying multiple approaches with SQL, Excel, R to come up with dynamic solution which can take up parameters like number of queries you want to run, maximum size of the tables combination you want to tolerate for etc. Any best approaches or suggestions here will be appreciated. Update: The queries will need all of its participating tables to be available to run. So we cannot say t1 alone will satisfy requirement of two queries or t2 alone can satisfy requirements of 3 queries.

Reputation: 845

multi level combinational knapsack

I have large set of combinations of tables used in data warehouse queries.

Example:

QueryID    Table
--------   ------
1          t1
1          t2
1          t3
--------------
2          t1
2          t3
--------------
3          t4
3          t2
--------------
4          t2
4          t3
4          t4
--------------
5          t3
5          t1

...and more.... (around thousands of such combinations). My current analysis involves finding patterns out of this like most frequently used together tables etc. My next analysis point is to find maximum number of queries that can be run by using minimum number of tables and also taking the tables' sizes into consideration.

For example from the above data, we can have minimum of 2 queries run by using three tables combination (t1,t2,t3) and (t1,t3) and (t2,t3,t4) etc.... For example if the sizes of the tables are

table    size
-----    -----
t1        20 GB
t2        40 GB
t3        10 GB
t4        50 GB

Then

(t1,t2,t3) together sized 70 GB can run three queries
(t1,t3) together sized 30 GB can run two queries
(t2,t3,t4) together sized 100 GB can run two queries

Out of these (t1,t3) is the best combination of tables with minimum size and count that can run two queries. I am trying multiple approaches with SQL, Excel, R to come up with dynamic solution which can take up parameters like number of queries you want to run, maximum size of the tables combination you want to tolerate for etc. Any best approaches or suggestions here will be appreciated.

Update:

The queries will need all of its participating tables to be available to run. So we cannot say t1 alone will satisfy requirement of two queries or t2 alone can satisfy requirements of 3 queries.

Upvotes: 2

Answers (3)

Vikram Bhat

Reputation: 6246

You should go for brute force for all combination of tables within admissible size and get one with max queries covered, it will generate all subsets hence O(2^N*N). If brute is not feasible then i am afraid your problem is harder than knapsack problem at least and there is no polynomial time solution for it. Another case is where a good solution will do for you then greedy knapsack or use of genetic algorithm to get good feasible solution

Upvotes: 0

TheComeOnMan

Reputation: 12905

If I understood right, you want the minimum size needed for x number of queries to be run

library(data.table)

#datasets borrowed from @androboy s answer (removed special character for code formatting to work)
quer <- structure(list(QueryID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), 
                       Table = c("t1", "t2", "t3", "t1", "t3", "t4", "t2", "t2", "t3", "t4")), 
                  .Names = c("QueryID", "Table"), class = "data.frame", row.names = c(NA, -10L))
size <- structure(list(Table = c("t1", "t2", "t3", "t4"), 
                       GB = c(20L, 40L, 10L, 50L)), .Names = c("Table", "GB"), 
                  class = "data.frame", row.names = c(NA, -4L))

quer <- data.table(quer)
size <- data.table(size)
# number of queries to run
queries <- 2

# creating unique combinations of two queries each-------------------
querylist <- vector(mode="list",length=queries)
for(i in seq(queries))
{
  querylist[[i]]<-unique(quer$QueryID)
}
qdf <- (expand.grid(querylist))
# removing rows with same query counted twice
if ( queries > 1)
{
  test <- apply(
    t(combn(seq(queries),2)),
    1,
    function(x)
    {
      (qdf[,x[1]] != qdf[,x[2]])
    }
  )

  qdf <- qdf[rowSums(test) == (queries-1),]
}

qdf <- data.table(qdf)
# checking tablesizes needed to run this combination-------------------
qdf[,tablesneeded := '']
qdf[,sizeneeded := as.integer(NA)]

setkeyv(quer,'QueryID')
setkeyv(size,'Table')

for( i in seq(nrow(qdf)))
{
  Tables <- quer[data.table(V1 =unlist(qdf[i,grep(colnames(qdf), pattern = "Var", value = TRUE), with = FALSE]))[,keyby = 'V1']][,unique(Table)]

  qdf[i, tablesneeded := paste(Tables,collapse = ',')]

  qdf[i, sizeneeded := as.integer(sum(size[data.table(V1 = Tables)[, keyby = 'V1']][,GB], na.rm = TRUE))]

}

# lowest size option for current number of queries-----------------------
qdf[which.min(sizeneeded)]

Upvotes: 1

Jean V. Adams

Reputation: 4784

It's not clear to me how you are calculating the number of queries for a group of tables. I can see that t1 has two queries, t2 has three queries, and t3 has three queries. How do you calculate the number of queries from the combination of tables t1, t2, and t3?

# the data that you posted
quer <- structure(list(QueryID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), 
    Table = c("t1", "t2", "t3", "t1", "t3", "t4", "t2", "t2", "t3", "t4")), 
    .Names = c("QueryID", "Table"), class = "data.frame", row.names = c(NA, -10L))
size <- structure(list(Table = c("t1", "t2", "t3", "t4"), 
    GB = c(20L, 40L, 10L, 50L)), .Names = c("Table", "GB"), 
    class = "data.frame", row.names = c(NA, -4L))

# perhaps a helpful way to reorganize your query data
both <- merge(quer, size)
size2 <- tapply(both$GB, list(both$QueryID, both$Table), mean)
size2
  t1 t2 t3 t4
1 20 40 10 NA
2 20 NA 10 NA
3 NA 40 NA 50
4 NA 40 10 50

apply(size2, 1, sum, na.rm=TRUE)
  1   2   3   4 
 70  30  90 100

Upvotes: 2

multi level combinational knapsack

Answers (3)

Related Questions