Reputation: 171
Using PostgreSQL, I have a very large table of ~250 million rows composed of the following 4 attributes:
CREATE TABLE public.sim_values_english
(
id bigint NOT NULL DEFAULT nextval('sim_values_english_id_seq'::regclass),
word1 character varying(255),
word2 character varying(255),
sim_value double precision,
CONSTRAINT pk_sim_values_english PRIMARY KEY(id)
)
I am trying to select the sim_value
attribute based on the defined word1
and word2
using the following SELECT query:
(Assume I have 2 words: X and Y in which X or Y can belong to either word1 or word2)
SELECT sim_value
From public.sim_values_english
Where (word1='X' or word2='X') and (word1='Y' or word2='Y') and (word1!=word2)
This is taking on average 1~1.5 minutes to return the sim_value
which is really long! Storing the entire table in memory is very costly since it weights more than 10 GB).
How can I speed up this query? What are your suggestions?
P.S.: word1
and word2
are never identical so if for 1 case: word1
is X and word2
is Y, Then there does not exist another row where word1
is Y and word2
is X!
NOTE: I have searched for similar topics but none of them addressed this exact issue. Thank you for understanding
Thank you
Upvotes: 1
Views: 740
Reputation: 44220
Start by not storing the words, but only references to them, this will make the bridge-table leaner:
CREATE TABLE words_english
( word_id integer -- or: serial if you want
NOT NULL PRIMARY KEY
, word varchar UNIQUE
);
CREATE TABLE sim_values_english
( word_id integer NOT NULL references words_english (word_id)
, other_id integer NOT NULL references words_english (word_id)
, sim_value DOUBLE PRECISION NOT NULL DEFAULT 0.0
, PRIMARY KEY (word_id, other_id)
);
CREATE UNIQUE INDEX ON sim_values_english(other_id,word_id);
Your query can now be rewitten to:
SELECT sim_value
FROM sim_values_english v
JOIN words_english one ON v.word_id = one.word_id
JOIN words_english two ON v.other_id = two.word_id
WHERE one.word IN ('X' ,'Y')
AND two.word IN ( 'X', 'Y' )
AND v.word_id <> v.other_id
;
(or use a view to emulate the old table)
Check this older answer for a recipe for squeezing out fat, repetitive columns out of a table.
Upvotes: 0
Reputation: 1567
So you have to create index on these 2 fields
CREATE INDEX word1_word2_idx ON sim_values_english (word1, word2);
Then your query is very basic
SELECT sim_value
FROM sim_values_english
WHERE (word1='X' AND word2='Y') OR (word1='Y' AND word2='X')
Upvotes: 2
Reputation: 23840
Firstly, if you don't have it already, I would ensure that the following index exists:
CREATE INDEX ON sim_values_english(word1, word2);
Then I would try the following query:
SELECT sim_value
FROM sim_values_english
WHERE word1='X' AND word2='Y'
UNION ALL
SELECT sim_value
FROM sim_values_english
WHERE word1='Y' AND word2='X'
Upvotes: 4