问题描述:

Using PostgreSQL, I have a very large table of ~250 million rows composed of the following 4 attributes:

CREATE TABLE public.sim_values_english

(

id bigint NOT NULL DEFAULT nextval('sim_values_english_id_seq'::regclass),

word1 character varying(255),

word2 character varying(255),

sim_value double precision,

CONSTRAINT pk_sim_values_english PRIMARY KEY(id)

)

I am trying to select the sim_value attribute based on the defined word1 and word2 using the following SELECT query:

(Assume I have 2 words: X and Y in which X or Y can belong to either word1 or word2)

SELECT sim_value

From public.sim_values_english

Where (word1='X' or word2='X') and (word1='Y' or word2='Y') and (word1!=word2)

This is taking on average 1~1.5 minutes to return the sim_value which is really long! Storing the entire table in memory is very costly since it weights more than 10 GB).

How can I speed up this query? What are your suggestions?

P.S.: word1 and word2 are never identical so if for 1 case: word1 is X and word2 is Y, Then there does not exist another row where word1 is Y and word2 is X!

NOTE: I have searched for similar topics but none of them addressed this exact issue. Thank you for understanding

Thank you

网友答案:

Firstly, if you don't have it already, I would ensure that the following index exists:

CREATE INDEX ON sim_values_english(word1, word2);

Then I would try the following query:

SELECT sim_value
FROM sim_values_english
WHERE word1='X' AND word2='Y'
UNION ALL
SELECT sim_value
FROM sim_values_english
WHERE word1='Y' AND word2='X'
网友答案:

So you have to create index on these 2 fields

CREATE INDEX word1_word2_idx ON sim_values_english (word1, word2);

Then your query is very basic

SELECT sim_value
FROM sim_values_english
WHERE (word1='X' AND word2='Y') OR (word1='Y' AND word2='X')
网友答案:

Start by not storing the words, but only references to them, this will make the bridge-table leaner:

CREATE TABLE words_english
        ( word_id integer -- or: serial if you want
                NOT NULL PRIMARY KEY
        , word varchar UNIQUE
        );

CREATE TABLE sim_values_english
        ( word_id integer NOT NULL references words_english (word_id)
        , other_id integer NOT NULL references words_english (word_id)
        , sim_value DOUBLE PRECISION NOT NULL DEFAULT 0.0
        , PRIMARY KEY (word_id, other_id)
        );

CREATE UNIQUE INDEX ON sim_values_english(other_id,word_id);

Your query can now be rewitten to:

SELECT sim_value
FROM sim_values_english v
JOIN words_english one ON v.word_id = one.word_id
JOIN words_english two ON v.other_id = two.word_id
WHERE one.word IN ('X' ,'Y')
AND two.word IN ( 'X', 'Y' )
AND v.word_id <> v.other_id
        ;

(or use a view to emulate the old table)

Check this older answer for a recipe for squeezing out fat, repetitive columns out of a table.

相关阅读:
Top