h1fra
h1fra

Reputation: 324

Postgres optimisation of multicolumn indexes with text column

I have one question regarding optimization of index in Postgres, I didn't find much help online and I have struggled to get the answer myself by testing.

I have this table

CREATE TABLE "public"."crawls" (
    "id" uuid NOT NULL DEFAULT uuid_generate_v4(),
    "parent_id" uuid,
    "group_id" timestamp,
    "url" varchar(2083) NOT NULL,
    "done" boolean;
    PRIMARY KEY ("id")
);
CREATE UNIQUE INDEX "parentid_groupid_url" ON "public"."urls" USING BTREE ("parent_id","group_id","url");

It's an URLs store, that is used to compute a comprehensive list of URLs that are UNIQUE per parent and per group. I only need exact match on this index. This means parent_id can have multiple times the same time the same URLs as long as the group_id is different.

The table contains hundreds of millions of URLs and is mainly used for write, the UNIQUE index is for deduplication.

  UPDATE crawls
  SET
    done = TRUE
  WHERE
    url = $1 AND
    parent_id = $2 AND
    group_id = $3

INSERT 

INTO crawls (
      url,
      parent_id,
      group_id
    ) VALUES
      ('long urls', uuid, date)
    ON CONFLICT parentid_groupid_url DO NOTHING

Currently the perf are okay but could be better, and the index size is larger than the table itself because of url column.

I was wondering how I could improve the size and/or the perf ? (both if possible)

I thought about using a new column to hash (md5, sha1) the URL and use it in the index instead of the URL, so that the length is consistant, smaller and may be faster for Postgres, but I didn't find any help on that. I'm not sure it's efficient because of the "randomness" of a hash and I have hard time testing this hypothesis due to the size and the time to build the index on my prod.

Refs I found online:

Thanks,

Upvotes: 0

Views: 266

Answers (1)

jjanes
jjanes

Reputation: 44167

I thought about using a new column to hash (md5, sha1) the URL and use it in the index instead of the URL, so that the length is consistant, smaller and may be faster for Postgres

create index on crawls (parent_id,group_id,md5(url));

This will automatically enforce uniqueness (and also prohibit md5 collisions which are really distinct on the full URL--but in the absence of malice the chances of this occurring are tiny). However, it will not automatically be used for fast look-up, you have to adapt your queries to allow it to be used:

WHERE
  md5(url) = md5($1) AND
  parent_id = $2 AND
  group_id = $3

You could save more space by using a reprenation shorter than hex:

create index on crawls (parent_id,group_id,decode(md5(url),'hex'));

But that will make it even more cumbersome to use.

I'm not sure it's efficient because of the "randomness" of a hash

It depends entirely on your usage pattern and your data distribution. If you commonly access a series of records with the same parent_id and group_id and adjacent urls, and the number of records with same parent_id and group_id is large, then hashing the urls could decrease the effectiveness of caching.

I have hard time testing this hypothesis due to the size and the time to build the index on my prod.

Not having a test environment is working with both hands tied behind your back.

Upvotes: 1

Related Questions