Andrea Schmuttermair
Andrea Schmuttermair

Reputation: 670

MySQL replace .html links of given domain

In my database there are contenfields with lots of internal links. I have to change link structure from www.mydomain.de/page.html to www.mydomain.de/page/, but replace statement should respect the domain:

This is whats expected to be replaced:

www.mydomain.de/somepage.html -> www.mydomain.de/page/
www.mydomain.de/subfolder/page.html -> www.mydomain.de/subfolder/page/
www.mydomain.de/link.html?param=1 -> www.mydomain.de/page/?param=1
www.mydomain.de/another-link.html#hash -> www.mydomain.de/page/#hash

All other links should be untouched, here some examples, but could be any link on the web:

 www.some-domain.de/link.html    
 www.another-domain.com/somelink.html

There can be different links in one contentfield:

<p>If you want to read more, click 
<a href="http://www.mydomain.de/page.html">here</a> 
or there <a href="http://www.another-domain.com/somelink.html">there</a>

This is doing the replace:

UPDATE tablename 
SET contentfield = REPLACE(contentfield, '.html', '/') 

My ideas (but don't know how to create a statement for them):

It does not have to be 100% matching all 'mydomain.de' links, I am happy with 90%, but there should not be a wrong replacement in external links.

Upvotes: 3

Views: 1327

Answers (4)

Steve Chambers
Steve Chambers

Reputation: 39404

UPDATE: Have now made this into a blog post: http://stevettt.blogspot.co.uk/2018/02/a-mysql-regular-expression-replace.html


Please see the following Rextester Fiddle, which I think should produce all the results you have asked for:

Rextester Demo

Explanation

A pattern-replacing function is needed for this but unfortunately MySQL doesn't provide such a thing. So I wrote one (based on another that wasn't quite sufficient) and have posted it here. As mentioned in the referenced answer, this function has a limitation of not allowing substitution of capturing groups with backreferences. Hence it has been slightly adapted in the fiddle to take further parameters that allow it to perform a recursive replace in the found match for the replacement. (Note the use of allowed URL path characters in the regex as per this excellent answer).

Update SQL

The following SQL will update the table data using the function below:

UPDATE urls
SET url = reg_replace(
  url,
  'www\\.mydomain\\.de/[-A-Za-z0-9\\._~!\\$&''\\(\\)\\*\\+,;=:@%/]+\\.html',
  '/[^/]+\\.html',
  '/page/',
  TRUE,
  22, -- Min match length = www.mydomain.de/?.html = 22
  0,  -- No max match length
  7,  -- Min sub-match length = /?.html = 7
  0   -- No max sub-match length
  );

Function code

The UDF code used in the demo is also posted below. Note: The UDF delegates to a stored procedure since only stored procedures will allow recursion in MySQL.

-- ------------------------------------------------------------------------------------
-- USAGE
-- ------------------------------------------------------------------------------------
-- SELECT reg_replace(<subject>,
--                    <pattern>,
--                    <subpattern>,
--                    <replacement>,
--                    <greedy>,
--                    <minMatchLen>,
--                    <maxMatchLen>,
--                    <minSubMatchLen>,
--                    <maxSubMatchLen>);
-- where:
-- <subject> is the string to look in for doing the replacements
-- <pattern> is the regular expression to match against
-- <subpattern> is a regular expression to match against within each
--              portion of text that matches <pattern>
-- <replacement> is the replacement string
-- <greedy> is TRUE for greedy matching or FALSE for non-greedy matching
-- <minMatchLen> specifies the minimum match length
-- <maxMatchLen> specifies the maximum match length
-- <minSubMatchLen> specifies the minimum match length
-- <maxSubMatchLen> specifies the maximum match length
-- (minMatchLen, maxMatchLen, minSubMatchLen and maxSubMatchLen are used to improve
--  efficiency but are optional and can be set to 0 or NULL if not known/required)
-- Example:
-- SELECT reg_replace(txt, '[A-Z0-9]{3}', '[0-9]', '_', TRUE, 3, 3, 1, 1) FROM tbl;
DROP FUNCTION IF EXISTS reg_replace;
DELIMITER //
CREATE FUNCTION reg_replace(subject VARCHAR(21845), pattern VARCHAR(21845),
  subpattern VARCHAR(21845), replacement VARCHAR(21845), greedy BOOLEAN,
  minMatchLen INT, maxMatchLen INT, minSubMatchLen INT, maxSubMatchLen INT)
RETURNS VARCHAR(21845) DETERMINISTIC BEGIN
  DECLARE result VARCHAR(21845);
  CALL reg_replace_worker(
    subject, pattern, subpattern, replacement, greedy, minMatchLen, maxMatchLen,
    minSubMatchLen, maxSubMatchLen, result);
  RETURN result;
END;//
DELIMITER ;

DROP PROCEDURE IF EXISTS reg_replace_worker;
DELIMITER //
CREATE PROCEDURE reg_replace_worker(subject VARCHAR(21845), pattern VARCHAR(21845),
  subpattern VARCHAR(21845), replacement VARCHAR(21845), greedy BOOLEAN,
  minMatchLen INT, maxMatchLen INT, minSubMatchLen INT, maxSubMatchLen INT,
  OUT result VARCHAR(21845))
BEGIN
  DECLARE subStr, usePattern, useRepl VARCHAR(21845);
  DECLARE startPos, prevStartPos, startInc, len, lenInc INT;
  SET @@SESSION.max_sp_recursion_depth = 2;
  IF subject REGEXP pattern THEN
    SET result = '';
    -- Sanitize input parameter values
    SET minMatchLen = IF(minMatchLen < 1, 1, minMatchLen);
    SET maxMatchLen = IF(maxMatchLen < 1 OR maxMatchLen > CHAR_LENGTH(subject),
                         CHAR_LENGTH(subject), maxMatchLen);
    -- Set the pattern to use to match an entire string rather than part of a string
    SET usePattern = IF (LEFT(pattern, 1) = '^', pattern, CONCAT('^', pattern));
    SET usePattern = IF (RIGHT(pattern, 1) = '$', usePattern, CONCAT(usePattern, '$'));
    -- Set start position to 1 if pattern starts with ^ or doesn't end with $.
    IF LEFT(pattern, 1) = '^' OR RIGHT(pattern, 1) <> '$' THEN
      SET startPos = 1, startInc = 1;
    -- Otherwise (i.e. pattern ends with $ but doesn't start with ^): Set start pos
    -- to the min or max match length from the end (depending on "greedy" flag).
    ELSEIF greedy THEN
      SET startPos = CHAR_LENGTH(subject) - maxMatchLen + 1, startInc = 1;
    ELSE
      SET startPos = CHAR_LENGTH(subject) - minMatchLen + 1, startInc = -1;
    END IF;
    WHILE startPos >= 1 AND startPos <= CHAR_LENGTH(subject)
      AND startPos + minMatchLen - 1 <= CHAR_LENGTH(subject)
      AND !(LEFT(pattern, 1) = '^' AND startPos <> 1)
      AND !(RIGHT(pattern, 1) = '$'
            AND startPos + maxMatchLen - 1 < CHAR_LENGTH(subject)) DO
      -- Set start length to maximum if matching greedily or pattern ends with $.
      -- Otherwise set starting length to the minimum match length.
      IF greedy OR RIGHT(pattern, 1) = '$' THEN
        SET len = LEAST(CHAR_LENGTH(subject) - startPos + 1, maxMatchLen), lenInc = -1;
      ELSE
        SET len = minMatchLen, lenInc = 1;
      END IF;
      SET prevStartPos = startPos;
      lenLoop: WHILE len >= 1 AND len <= maxMatchLen
                 AND startPos + len - 1 <= CHAR_LENGTH(subject)
                 AND !(RIGHT(pattern, 1) = '$' 
                       AND startPos + len - 1 <> CHAR_LENGTH(subject)) DO
        SET subStr = SUBSTRING(subject, startPos, len);
        IF subStr REGEXP usePattern THEN
          IF subpattern IS NULL THEN
            SET useRepl = replacement;
          ELSE
            CALL reg_replace_worker(subStr, subpattern, NULL, replacement, greedy, 
                          minSubMatchLen, maxSubMatchLen, NULL, NULL, useRepl);
          END IF;
          SET result = IF(startInc = 1,
                          CONCAT(result, useRepl), CONCAT(useRepl, result));
          SET startPos = startPos + startInc * len;
          LEAVE lenLoop;
        END IF;
        SET len = len + lenInc;
      END WHILE;
      IF (startPos = prevStartPos) THEN
        SET result = IF(startInc = 1, CONCAT(result, SUBSTRING(subject, startPos, 1)),
                        CONCAT(SUBSTRING(subject, startPos, 1), result));
        SET startPos = startPos + startInc;
      END IF;
    END WHILE;
    IF startInc = 1 AND startPos <= CHAR_LENGTH(subject) THEN
      SET result = CONCAT(result, RIGHT(subject, CHAR_LENGTH(subject) + 1 - startPos));
    ELSEIF startInc = -1 AND startPos >= 1 THEN
      SET result = CONCAT(LEFT(subject, startPos), result);
    END IF;
  ELSE
    SET result = subject;
  END IF;
END;//
DELIMITER ;

Upvotes: 4

Maksim.T
Maksim.T

Reputation: 531

I would just export table to CSV or something, then used notepad++/excel/etc. internal tool to replace
'.html' with '/'.

And then import back to SQL.

Also, since mysql supports regular expressions, you can search for your domain, containing .html with it.

 mydomain.de[^s]+.html

Upvotes: 1

ScaisEdge
ScaisEdge

Reputation: 133360

you can use like

 UPDATE tablename 
 SET contentfield = REPLACE(contentfield, '.html', '/') 
 where contentfield  like 'www.mydomain.de%' 
 AND contentfield  like '%html%'
 AND ( contentfield  not like 'www.another-domain.com/somelink%'  OR  
         contentfield  not like 'www.another-domain.com/subfolder/link%' )

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269693

Does this do what you want?

UPDATE tablename 
    SET contentfield = REPLACE(contentfield, '.html', '/') 
    WHERE contentfield like 'www.mydomain.de/%';

It should work for the examples in the question.

If you like, you can use the condition to only match rows that actually have ".html" in them.

    WHERE contentfield like 'www.mydomain.de/%.html%'

Upvotes: 2

Related Questions