Reputation: 21
I have got address database with 1 million rows. And user will be add any address text(without specific structure and grammar mistakes acceptable). I must seperate address by sections like region, city, town, village and so on. So I almost have done it with trigram alghoritm. But it's so slow. My question is how can I optimize my request? For now I have got this:
FROM adresses_1
ORDER BY SIMILARITY(CONCAT(region, district, city, town, area, street, building), **address_text**) DESC
LIMIT 1;```
Upvotes: 2
Views: 318
Reputation: 43
you could run the addresses they enter through an address standardization API (like smartystreets) to validate the address and pick out the address components you want (to store in discreet fields). This will make future retrieval, filtering, proximity searching, etc very accurate. I have used smartystreets on millions of records in the past.
Upvotes: 2
Reputation: 44343
Your expression as written is not indexable. If you build a GiST trigram index on the expression CONCAT(region, district, city, town, area, street, building)
, then you could use:
ORDER BY CONCAT(region, district, city, town, area, street, building) <-> **address_text** ASC
LIMIT 1
Or if you build the GIN trigram index instead, the ORDER BY wouldn't be directly indexable; but instead you could use the index to efficiently filter out anything "obviously" not close, then sort the remaining ones.
WHERE CONCAT(region, district, city, town, area, street, building) % **address_text**
ORDER BY SIMILARITY(CONCAT(region, district, city, town, area, street, building), **address_text**) DESC
LIMIT 1
Or you could do as Jake proposes, and use software specially written for standardizing addresses.
Upvotes: 0