Reputation: 670
In my database there are contenfields with lots of internal links. I have to change link structure from www.mydomain.de/page.html to www.mydomain.de/page/, but replace statement should respect the domain:
This is whats expected to be replaced:
www.mydomain.de/somepage.html -> www.mydomain.de/page/
www.mydomain.de/subfolder/page.html -> www.mydomain.de/subfolder/page/
www.mydomain.de/link.html?param=1 -> www.mydomain.de/page/?param=1
www.mydomain.de/another-link.html#hash -> www.mydomain.de/page/#hash
All other links should be untouched, here some examples, but could be any link on the web:
www.some-domain.de/link.html
www.another-domain.com/somelink.html
There can be different links in one contentfield:
<p>If you want to read more, click
<a href="http://www.mydomain.de/page.html">here</a>
or there <a href="http://www.another-domain.com/somelink.html">there</a>
This is doing the replace:
UPDATE tablename
SET contentfield = REPLACE(contentfield, '.html', '/')
My ideas (but don't know how to create a statement for them):
It does not have to be 100% matching all 'mydomain.de' links, I am happy with 90%, but there should not be a wrong replacement in external links.
Upvotes: 3
Views: 1327
Reputation: 39404
UPDATE: Have now made this into a blog post: http://stevettt.blogspot.co.uk/2018/02/a-mysql-regular-expression-replace.html
Please see the following Rextester Fiddle, which I think should produce all the results you have asked for:
Explanation
A pattern-replacing function is needed for this but unfortunately MySQL doesn't provide such a thing. So I wrote one (based on another that wasn't quite sufficient) and have posted it here. As mentioned in the referenced answer, this function has a limitation of not allowing substitution of capturing groups with backreferences. Hence it has been slightly adapted in the fiddle to take further parameters that allow it to perform a recursive replace in the found match for the replacement. (Note the use of allowed URL path characters in the regex as per this excellent answer).
Update SQL
The following SQL will update the table data using the function below:
UPDATE urls
SET url = reg_replace(
url,
'www\\.mydomain\\.de/[-A-Za-z0-9\\._~!\\$&''\\(\\)\\*\\+,;=:@%/]+\\.html',
'/[^/]+\\.html',
'/page/',
TRUE,
22, -- Min match length = www.mydomain.de/?.html = 22
0, -- No max match length
7, -- Min sub-match length = /?.html = 7
0 -- No max sub-match length
);
Function code
The UDF code used in the demo is also posted below. Note: The UDF delegates to a stored procedure since only stored procedures will allow recursion in MySQL.
-- ------------------------------------------------------------------------------------
-- USAGE
-- ------------------------------------------------------------------------------------
-- SELECT reg_replace(<subject>,
-- <pattern>,
-- <subpattern>,
-- <replacement>,
-- <greedy>,
-- <minMatchLen>,
-- <maxMatchLen>,
-- <minSubMatchLen>,
-- <maxSubMatchLen>);
-- where:
-- <subject> is the string to look in for doing the replacements
-- <pattern> is the regular expression to match against
-- <subpattern> is a regular expression to match against within each
-- portion of text that matches <pattern>
-- <replacement> is the replacement string
-- <greedy> is TRUE for greedy matching or FALSE for non-greedy matching
-- <minMatchLen> specifies the minimum match length
-- <maxMatchLen> specifies the maximum match length
-- <minSubMatchLen> specifies the minimum match length
-- <maxSubMatchLen> specifies the maximum match length
-- (minMatchLen, maxMatchLen, minSubMatchLen and maxSubMatchLen are used to improve
-- efficiency but are optional and can be set to 0 or NULL if not known/required)
-- Example:
-- SELECT reg_replace(txt, '[A-Z0-9]{3}', '[0-9]', '_', TRUE, 3, 3, 1, 1) FROM tbl;
DROP FUNCTION IF EXISTS reg_replace;
DELIMITER //
CREATE FUNCTION reg_replace(subject VARCHAR(21845), pattern VARCHAR(21845),
subpattern VARCHAR(21845), replacement VARCHAR(21845), greedy BOOLEAN,
minMatchLen INT, maxMatchLen INT, minSubMatchLen INT, maxSubMatchLen INT)
RETURNS VARCHAR(21845) DETERMINISTIC BEGIN
DECLARE result VARCHAR(21845);
CALL reg_replace_worker(
subject, pattern, subpattern, replacement, greedy, minMatchLen, maxMatchLen,
minSubMatchLen, maxSubMatchLen, result);
RETURN result;
END;//
DELIMITER ;
DROP PROCEDURE IF EXISTS reg_replace_worker;
DELIMITER //
CREATE PROCEDURE reg_replace_worker(subject VARCHAR(21845), pattern VARCHAR(21845),
subpattern VARCHAR(21845), replacement VARCHAR(21845), greedy BOOLEAN,
minMatchLen INT, maxMatchLen INT, minSubMatchLen INT, maxSubMatchLen INT,
OUT result VARCHAR(21845))
BEGIN
DECLARE subStr, usePattern, useRepl VARCHAR(21845);
DECLARE startPos, prevStartPos, startInc, len, lenInc INT;
SET @@SESSION.max_sp_recursion_depth = 2;
IF subject REGEXP pattern THEN
SET result = '';
-- Sanitize input parameter values
SET minMatchLen = IF(minMatchLen < 1, 1, minMatchLen);
SET maxMatchLen = IF(maxMatchLen < 1 OR maxMatchLen > CHAR_LENGTH(subject),
CHAR_LENGTH(subject), maxMatchLen);
-- Set the pattern to use to match an entire string rather than part of a string
SET usePattern = IF (LEFT(pattern, 1) = '^', pattern, CONCAT('^', pattern));
SET usePattern = IF (RIGHT(pattern, 1) = '$', usePattern, CONCAT(usePattern, '$'));
-- Set start position to 1 if pattern starts with ^ or doesn't end with $.
IF LEFT(pattern, 1) = '^' OR RIGHT(pattern, 1) <> '$' THEN
SET startPos = 1, startInc = 1;
-- Otherwise (i.e. pattern ends with $ but doesn't start with ^): Set start pos
-- to the min or max match length from the end (depending on "greedy" flag).
ELSEIF greedy THEN
SET startPos = CHAR_LENGTH(subject) - maxMatchLen + 1, startInc = 1;
ELSE
SET startPos = CHAR_LENGTH(subject) - minMatchLen + 1, startInc = -1;
END IF;
WHILE startPos >= 1 AND startPos <= CHAR_LENGTH(subject)
AND startPos + minMatchLen - 1 <= CHAR_LENGTH(subject)
AND !(LEFT(pattern, 1) = '^' AND startPos <> 1)
AND !(RIGHT(pattern, 1) = '$'
AND startPos + maxMatchLen - 1 < CHAR_LENGTH(subject)) DO
-- Set start length to maximum if matching greedily or pattern ends with $.
-- Otherwise set starting length to the minimum match length.
IF greedy OR RIGHT(pattern, 1) = '$' THEN
SET len = LEAST(CHAR_LENGTH(subject) - startPos + 1, maxMatchLen), lenInc = -1;
ELSE
SET len = minMatchLen, lenInc = 1;
END IF;
SET prevStartPos = startPos;
lenLoop: WHILE len >= 1 AND len <= maxMatchLen
AND startPos + len - 1 <= CHAR_LENGTH(subject)
AND !(RIGHT(pattern, 1) = '$'
AND startPos + len - 1 <> CHAR_LENGTH(subject)) DO
SET subStr = SUBSTRING(subject, startPos, len);
IF subStr REGEXP usePattern THEN
IF subpattern IS NULL THEN
SET useRepl = replacement;
ELSE
CALL reg_replace_worker(subStr, subpattern, NULL, replacement, greedy,
minSubMatchLen, maxSubMatchLen, NULL, NULL, useRepl);
END IF;
SET result = IF(startInc = 1,
CONCAT(result, useRepl), CONCAT(useRepl, result));
SET startPos = startPos + startInc * len;
LEAVE lenLoop;
END IF;
SET len = len + lenInc;
END WHILE;
IF (startPos = prevStartPos) THEN
SET result = IF(startInc = 1, CONCAT(result, SUBSTRING(subject, startPos, 1)),
CONCAT(SUBSTRING(subject, startPos, 1), result));
SET startPos = startPos + startInc;
END IF;
END WHILE;
IF startInc = 1 AND startPos <= CHAR_LENGTH(subject) THEN
SET result = CONCAT(result, RIGHT(subject, CHAR_LENGTH(subject) + 1 - startPos));
ELSEIF startInc = -1 AND startPos >= 1 THEN
SET result = CONCAT(LEFT(subject, startPos), result);
END IF;
ELSE
SET result = subject;
END IF;
END;//
DELIMITER ;
Upvotes: 4
Reputation: 531
I would just export table to CSV or something, then used notepad++/excel/etc. internal tool to replace
'.html' with '/'.
And then import back to SQL.
Also, since mysql supports regular expressions, you can search for your domain, containing .html with it.
mydomain.de[^s]+.html
Upvotes: 1
Reputation: 133360
you can use like
UPDATE tablename
SET contentfield = REPLACE(contentfield, '.html', '/')
where contentfield like 'www.mydomain.de%'
AND contentfield like '%html%'
AND ( contentfield not like 'www.another-domain.com/somelink%' OR
contentfield not like 'www.another-domain.com/subfolder/link%' )
Upvotes: 0
Reputation: 1269693
Does this do what you want?
UPDATE tablename
SET contentfield = REPLACE(contentfield, '.html', '/')
WHERE contentfield like 'www.mydomain.de/%';
It should work for the examples in the question.
If you like, you can use the condition to only match rows that actually have ".html" in them.
WHERE contentfield like 'www.mydomain.de/%.html%'
Upvotes: 2