patrickjp93
patrickjp93

Reputation: 419

SQL Loop Multi Substring Replace

I need to build a function to replace multiple substrings for all rows in a table.

Performance is not a big concern, as this is a one-time operation, but there are 48 mappings and roughly 30,000 rows. I know looping over the whole database 48 times is quite stupid, but SQL is not my wheelhouse. If this were Java or C++, it'd be cake.

Basically, I need the SQL analog of the following function. If SQL can't short-circuit loops, that's fine. I've seen the SQL replace function, but encapsulating it properly in a user-defined function is my major stumbling block.

I'm using Microsoft SQL Server if that produces any particular quirks.

mapping[] maps = { {" st ", " Street "}, {" st. ", " Street "}, ...};

for(row r : table) {
    String orig = r.data(colName);
    for(mapping m : maps) {
        r.data(colName).replace(m.first, m.second);
        if(r.data(colName) != orig)
            break;
    }
}

Upvotes: 1

Views: 4500

Answers (2)

CREATE FUNCTION [dbo].[StandardizeAddress](@address varchar(123))
RETURNS varchar(250)
WITH SCHEMABINDING
AS
BEGIN
    RETURN
        REPLACE(REPLACE(
                @address + ' '
                , ' st ', ' Street')
                , ' st. ', ' Street ')
END

Creating a scalar function like this is how we did this. Using the code above to compute addresses from a table of 171,000 rows took 240 ms. Using our actual function, which has more than 80 substitutions, and does some other manipulations takes 5 seconds for 171,000 rows. However, we store the standardized version of addresses, because we are doing complex person searches and precompute standardized values for performance sake. So the function is only run once when a row is added or an address modified, so the speed of this function is not an issue.

For comparison, Gordon's solution takes 4.5 seconds against the same dataset (vs. 240 ms for chained REPLACE). With 4 replacements instead of two, the CTE solution takes 7.8 seconds, vs. 275 ms for REPLACE.

I need to add a caveat that there is a limit to how many function calls can be nested. According to another question on stackOverflow, the limit is 244, which is a fair amount larger than the default max recursion limit for recursive CTEs.

Another option that's a bit slower (about 75% more time) than nested REPLACE functions is something like the following:

select c3.address from (select REPLACE(@address, ' st ', ' Street ') address) c1
        cross apply (select REPLACE(c1.address, ' st. ', ' Street ') address) c2
        cross apply (select REPLACE(c2.address, ' dr ', ' Drive ') address) c3

But I don't see much advantage for that. You could also write a c# CLR function, but I suspect that the calling overhead might make it slower than just using the nested REPLACE calls.

Edit- Since posting this I posted an answer to a similar question that seems to be in the speed ballpark of nested REPLACE, but is much cleaner (and more maintainable) code.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269823

@Hogan has the right idea. This syntax should be closer to working:

WITH map as (
      SELECT v.*
      FROM (VALUES (' st ', ' Street ', 1),
                   (' st. ', ' Street ', 2)
           ) v(str, repstr, n)
     ),
     cte as (
      SELECT replace(t.field, map.str, map.repstr) as field, map.n as n
      FROM t JOIN
           map
           ON map.n = 1
      UNION ALL
      SELECT replace(cte.field, map.str, map.repstr) as field, map.n + 1
      FROM cte JOIN
           map
           ON map.n = cte.n + 1
     )
SELECT field 
FROM (SELECT cte.*, MAX(cte.n) OVER (PARTITION BY cte.field) as maxn
      FROM cte
     ) x
WHERE n = maxn;

You may want to include more fields in the CTE from the original table.

Upvotes: 2

Related Questions