Reputation: 323
I have to sum a huge number of data with aggregation and where clause, using this query what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term. I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms The problem is that this query takes more than 15 min. because I have huge tables any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
Upvotes: 0
Views: 2024
Reputation: 4141
If you have a sufficient amount of free disk space, then create a materialized view
The query rewrite is optional. If you can modify the above insert-select
, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
Plag_User_Terms.ISN
(with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
Plag_User_Terms.ISN
if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
plag_correlations3
table in parallelBottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.
Upvotes: 1