Reputation:
I have a big Table in my Oracle Database (12c) and I have a lot of Strings that need to be compared to each other. I want to see which Strings are similar.
Example:
Hello World Boston in America
Hello World Berlin in Germany
Should be detected as similiar Strings, while
Hello World Paris in France
Hello Detroit a nice city
should not be detected as similar. How can I do this in Oracle PL/SQL efficiently and accurately?
Upvotes: 3
Views: 1129
Reputation: 1032
You should use the Dice-Coefficient which has a nice balance between performance and efficiency. It will give you a NUMBER which represents how similar two strings are. You can run this Function for all of your strings comparing them to each other and then choose the ones which have a Big Coefficient (those will be very similar).
-- Dice-Coefficient in Oracle PL/SQL
-- http://en.wikipedia.org/wiki/Dice%27s_coefficient
-- http://en.wikipedia.org/wiki/S%C3%B8rensen_similarity_index
-------------------------------------------------------------------------------
FUNCTION DICE_COEFF(p_str1 IN VARCHAR2, p_str2 IN VARCHAR2) RETURN NUMBER DETERMINISTIC IS
co_substr_len CONSTANT NUMBER := 2;
TYPE ty_varchar_assoc IS TABLE OF VARCHAR2(100) INDEX BY VARCHAR2(100);
v_x ty_varchar_assoc;
v_y ty_varchar_assoc;
v_inter ty_varchar_assoc;
v_part VARCHAR2(10);
BEGIN
-- building set X
FOR i IN 1 .. length(p_str1) - co_substr_len + 1 LOOP
v_part := substr(p_str1, i, co_substr_len);
v_x(v_part) := v_part;
END LOOP;
-- building set Y
FOR i IN 1 .. length(p_str2) - co_substr_len + 1 LOOP
v_part := substr(p_str2, i, co_substr_len);
v_y(v_part) := v_part;
IF v_x.exists(v_part) THEN
v_inter(v_part) := v_part; -- build intersect
END IF;
END LOOP;
RETURN 2 * v_inter.count /(v_x.count + v_y.count);
END DICE_COEFF;
Upvotes: 3