Reputation: 845
I have large set of combinations of tables used in data warehouse queries.
Example:
QueryID Table
-------- ------
1 t1
1 t2
1 t3
--------------
2 t1
2 t3
--------------
3 t4
3 t2
--------------
4 t2
4 t3
4 t4
--------------
5 t3
5 t1
...and more.... (around thousands of such combinations). My current analysis involves finding patterns out of this like most frequently used together tables etc. My next analysis point is to find maximum number of queries that can be run by using minimum number of tables and also taking the tables' sizes into consideration.
For example from the above data, we can have minimum of 2 queries run by using three tables combination (t1,t2,t3) and (t1,t3) and (t2,t3,t4) etc.... For example if the sizes of the tables are
table size
----- -----
t1 20 GB
t2 40 GB
t3 10 GB
t4 50 GB
Then
Out of these (t1,t3) is the best combination of tables with minimum size and count that can run two queries. I am trying multiple approaches with SQL, Excel, R to come up with dynamic solution which can take up parameters like number of queries you want to run, maximum size of the tables combination you want to tolerate for etc. Any best approaches or suggestions here will be appreciated.
Update:
The queries will need all of its participating tables to be available to run. So we cannot say t1 alone will satisfy requirement of two queries or t2 alone can satisfy requirements of 3 queries.
Upvotes: 2
Views: 266
Reputation: 6246
You should go for brute force for all combination of tables within admissible size and get one with max queries covered, it will generate all subsets hence O(2^N*N)
. If brute is not feasible then i am afraid your problem is harder than knapsack problem at least and there is no polynomial time solution for it. Another case is where a good solution will do for you then greedy knapsack or use of genetic algorithm to get good feasible solution
Upvotes: 0
Reputation: 12905
If I understood right, you want the minimum size needed for x number of queries to be run
library(data.table)
#datasets borrowed from @androboy s answer (removed special character for code formatting to work)
quer <- structure(list(QueryID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L),
Table = c("t1", "t2", "t3", "t1", "t3", "t4", "t2", "t2", "t3", "t4")),
.Names = c("QueryID", "Table"), class = "data.frame", row.names = c(NA, -10L))
size <- structure(list(Table = c("t1", "t2", "t3", "t4"),
GB = c(20L, 40L, 10L, 50L)), .Names = c("Table", "GB"),
class = "data.frame", row.names = c(NA, -4L))
quer <- data.table(quer)
size <- data.table(size)
# number of queries to run
queries <- 2
# creating unique combinations of two queries each-------------------
querylist <- vector(mode="list",length=queries)
for(i in seq(queries))
{
querylist[[i]]<-unique(quer$QueryID)
}
qdf <- (expand.grid(querylist))
# removing rows with same query counted twice
if ( queries > 1)
{
test <- apply(
t(combn(seq(queries),2)),
1,
function(x)
{
(qdf[,x[1]] != qdf[,x[2]])
}
)
qdf <- qdf[rowSums(test) == (queries-1),]
}
qdf <- data.table(qdf)
# checking tablesizes needed to run this combination-------------------
qdf[,tablesneeded := '']
qdf[,sizeneeded := as.integer(NA)]
setkeyv(quer,'QueryID')
setkeyv(size,'Table')
for( i in seq(nrow(qdf)))
{
Tables <- quer[data.table(V1 =unlist(qdf[i,grep(colnames(qdf), pattern = "Var", value = TRUE), with = FALSE]))[,keyby = 'V1']][,unique(Table)]
qdf[i, tablesneeded := paste(Tables,collapse = ',')]
qdf[i, sizeneeded := as.integer(sum(size[data.table(V1 = Tables)[, keyby = 'V1']][,GB], na.rm = TRUE))]
}
# lowest size option for current number of queries-----------------------
qdf[which.min(sizeneeded)]
Upvotes: 1
Reputation: 4784
It's not clear to me how you are calculating the number of queries for a group of tables. I can see that t1
has two queries, t2
has three queries, and t3
has three queries. How do you calculate the number of queries from the combination of tables t1
, t2
, and t3
?
# the data that you posted
quer <- structure(list(QueryID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L),
Table = c("t1", "t2", "t3", "t1", "t3", "t4", "t2", "t2", "t3", "t4")),
.Names = c("QueryID", "Table"), class = "data.frame", row.names = c(NA, -10L))
size <- structure(list(Table = c("t1", "t2", "t3", "t4"),
GB = c(20L, 40L, 10L, 50L)), .Names = c("Table", "GB"),
class = "data.frame", row.names = c(NA, -4L))
# perhaps a helpful way to reorganize your query data
both <- merge(quer, size)
size2 <- tapply(both$GB, list(both$QueryID, both$Table), mean)
size2
t1 t2 t3 t4
1 20 40 10 NA
2 20 NA 10 NA
3 NA 40 NA 50
4 NA 40 10 50
apply(size2, 1, sum, na.rm=TRUE)
1 2 3 4
70 30 90 100
Upvotes: 2