String replacement in Postgresql originating an array of additional strings

Question

Suppose you have two tables with substitutions which MUST be kept as they are and another table containing a body of names. How could I get all the possible substitutions?

Substitution Table
--------------------------------------
word        subs_list
MOUNTAIN    MOUNTAIN, MOUNT, MT, MTN
HOUSE       HAUS, HOUSE
VIEW        VU, VIEW

Synonyms table
-------------------------------------------------
EDUCATION   SCHOOL, UNIVERSITY, COLLEGE, TRAINING
FOOD        STORE, FOOD, CAFE
STORE       FOOD, STORE, MARKET
REFRIGERATION   FOODLOCKER, FREEZE, FRIDGE

names table
------------------------------------------------
MOUNT VU FOOD USA 
MOUNTAIN VU STORE CA

Note: I know that it would be desirable to have just one substitution table, but both substution tables must remain because they served to additional purposes than the one explained above, those tables are already in used. In addition, the list of replacements in both tables are just a varchar with a string separated by commas

Considering the previous, the problem is to generate possible names derived by substitution. For instance, the name MOUNT VU FOOD USA should be decomposed to MOUNTAIN VIEW FOOD USA and MOUNTAIN VIEW STORE USA, the same fashion would apply for the second.

I have been able to get the replacements in a wrong order and all together in function, there is a way to get an array as output with the different names generated after replacement? So far I have created this function for replacement:

create or replace function replace_companies_array(i_sentence IN VARCHAR) returns VARCHAR[] AS $p_replaced$
DECLARE
  p_replaced VARCHAR[];
  subs RECORD;
  flag boolean:= True;
  cur_s CURSOR(i_sentence VARCHAR)
    FOR SELECT w.input, coalesce(x.word, w.input) as word, count(*) OVER (PARTITION BY w.input) as counter
    FROM regexp_split_to_table(trim(i_sentence), '\s') as w(input) 
     LEFT JOIN (
      select s.word, trim(s1.token) as token
      from subs01 s
       cross join unnest(string_to_array(s.subs_list, ',')) s1(token)
      union
      select sy.word, trim(s2.token) as token
      from syns01 sy
       cross join unnest(string_to_array(sy.syn_list, ',')) s2(token)
     ) as x on lower(trim(w.input)) = lower(x.token)
   order by counter;
BEGIN
  OPEN cur_s(i_sentence);

  LOOP
   --fetch row into the substitutions
     FETCH cur_s INTO subs;

   --Exit when no more rows to fetch
     EXIT WHEN NOT FOUND;

     SELECT REGEXP_REPLACE(i_sentence,'(^|[^a-z0-9])' || subs.input || '($|[^a-z0-9])','\1' || UPPER(subs.word) || '\2','g')
     INTO i_sentence; 

  END LOOP;
  p_replaced:=array_append(p_replaced, i_sentence);

  RETURN p_replaced;

END;
$p_replaced$ LANGUAGE plpgsql;

Thank you so much for your contributions

Thomas B · Accepted Answer

I didn't manage to get the final result, but I'w quite close to it!

From sentence: MOUNT VU FOOD USA, I obtain {"MOUNTAIN VIEW MARKET USA","MOUNTAIN VIEW STORE USA","MOUNTAIN VIEW CAFE USA","MOUNTAIN VIEW FOOD USA"}

Here are all my script to recreate the synonyms & substitute tables:

DROP TABLE IF EXISTS subs01;
DROP TABLE IF EXISTS syns01;
CREATE TABLE subs01 (word VARCHAR(20), subs_list VARCHAR(200));
CREATE TABLE syns01 (word VARCHAR(20), syn_list VARCHAR(200));

INSERT INTO subs01 (word, subs_list) VALUES ('MOUNTAIN', 'MOUNTAIN, MOUNT, MT, MTN'),('HOUSE', 'HAUS, HOUSE'),('VIEW', 'VU, VIEW');
INSERT INTO syns01 (word, syn_list) VALUES ('EDUCATION', 'SCHOOL, UNIVERSITY, COLLEGE, TRAINING'),('FOOD', 'STORE, FOOD, CAFE'),('STORE', 'FOOD, STORE, MARKET'),('REFRIGERATION', 'FOODLOCKER, FREEZE, FRIDGE');

I decided to split the job into 2 phases:

Substitute the words:

CREATE OR REPLACE function substitute_words (i_sentence IN VARCHAR) returns VARCHAR AS $p_substituted$
DECLARE
  --p_substituted VARCHAR;
  subs_cursor CURSOR FOR select su.word, trim(s2.token) as token from subs01 su cross join unnest(string_to_array(su.subs_list, ',')) s2(token);
  subs_record record;
BEGIN
  OPEN subs_cursor;
  LOOP
    FETCH subs_cursor INTO subs_record;
    EXIT WHEN NOT FOUND;
    RAISE NOTICE 'INFO : TOKEN (%) ',subs_record.token ;

    IF i_sentence LIKE '%'|| subs_record.token || '%' THEN
      RAISE NOTICE '-- FOUND : TOKEN (%) ',subs_record.token ;
      SELECT replace (i_sentence, subs_record.token, subs_record.word) INTO i_sentence;
    END IF;
  END LOOP;
  CLOSE subs_cursor;
  RETURN i_sentence;
END
$p_substituted$ LANGUAGE plpgsql;

Replace known words by their synomyms:

CREATE OR REPLACE function synonymize_sentence (i_sentence IN VARCHAR) returns TABLE (sentence_result VARCHAR) AS $p_syn$
DECLARE
  syn_cursor CURSOR FOR select su.word, trim(s2.token) as token from syns01 su cross join unnest(string_to_array(su.syn_list, ',')) s2(token);
  syn_record record;
BEGIN
  CREATE TEMPORARY TABLE record_syn (result VARCHAR(200)) ON COMMIT DROP;
  INSERT INTO record_syn (result) SELECT i_sentence;
  OPEN syn_cursor;
  LOOP
    FETCH syn_cursor INTO syn_record;
    EXIT WHEN NOT FOUND;
    RAISE NOTICE 'INFO : WORD (%) ',syn_record.word ;

    INSERT INTO record_syn (result) SELECT replace (result, syn_record.word, syn_record.token) FROM record_syn where result LIKE '%'||     syn_record.word || '%';

  END LOOP;
  CLOSE syn_cursor;
  RETURN QUERY SELECT distinct result FROM record_syn;
END;
$p_syn$ LANGUAGE plpgsql;

Then, to generate the result array, I perform this statement:

SELECT ARRAY(SELECT synonymize_sentence (substitute_words ('MOUNT VU FOOD USA')));

String replacement in Postgresql originating an array of additional strings

Answers (1)

Related Questions