PsychoX
PsychoX

Reputation: 1098

Clean database from HTML/ MS Word tags

I have very large database to cleanup. On the old system user was permitted to use HTML (and paste from Word files). Now I've large amount of open tags in different places in DB.

Anyone know a application/script to perform this kind of cleanup?

Upvotes: 3

Views: 2499

Answers (2)

Branislav
Branislav

Reputation: 315

What if you have < and > characters as a part of the text, not a part of HTML. Even if your string contains broken HTML tag at the end of the string. So then you should use this version of the function:

SET GLOBAL log_bin_trust_function_creators=1;
DROP FUNCTION IF EXISTS fnStripTags;
DELIMITER |
CREATE FUNCTION fnStripTags( Dirty varchar(4096) )
RETURNS varchar(4096)
DETERMINISTIC 
BEGIN
  DECLARE iStart, iEnd, iLength, DirtyLength, DirtyTail int;
  DECLARE iNextLessThenAngleBracket, iLengthToNextLessThenAngleBracket int;
    WHILE Locate( '<', Dirty ) > 0 And Locate( '>', Dirty, Locate( '<', Dirty )) > 0 DO
      BEGIN
        SET iStart = Locate( '<', Dirty ), iEnd = Locate( '>', Dirty, Locate('<', Dirty ));
        SET iNextLessThenAngleBracket = Locate( '<', Dirty, Locate('<', Dirty ) + 1);
        SET iLength = ( iEnd - iStart) + 1;
        SET iLengthToNextLessThenAngleBracket = ( iNextLessThenAngleBracket - iStart) + 1;

        IF iLength < iLengthToNextLessThenAngleBracket THEN
            BEGIN
            IF iLength > 0 THEN
              BEGIN
                SET Dirty = Insert( Dirty, iStart, iLength, '');
              END;
            END IF;
            END;
        ELSE
            BEGIN
                IF iNextLessThenAngleBracket != 0 THEN
                    BEGIN
                        SET Dirty = Insert( Dirty, iStart, 1, '*/*');
                    END;
                ELSE
                    BEGIN
                        SET Dirty = Insert( Dirty, iStart, iLength, '');
                    END;
                END IF;
            END;
        END IF;

      END;
    END WHILE;
    IF Locate( '<', Dirty ) > 0 THEN
        BEGIN
            SET DirtyLength = CHAR_LENGTH(Dirty);
            SET DirtyTail = DirtyLength - Locate( '<', Dirty ) + 1;
            SET Dirty = Insert( Dirty, Locate( '<', Dirty ), DirtyTail, '');
        END;
    END IF;
    BEGIN
        SET Dirty = REPLACE(Dirty, '*/*', '<');
    END;
    RETURN Dirty;
END;
|
DELIMITER ;

Upvotes: 0

Ilia Sachev
Ilia Sachev

Reputation: 76

You may do this with some bits of php (or another backend language)

Here is an example written on php

$link = mysql_connect($host, $username, $password); //connecting to database 
mysql_select_db($database_name,$link); // select a database
$q = mysql_query('SELECT id_row,cell_with_tags FROM tablename WHERE some_condition_if_it_needed'); 
if(mysql_num_rows($q) !== 0){   //check if result of our query not empty
    while($res = mysql_fetch_assoc($q)){ // gets all rows-cells as associative array
         $result[] = array('id_row'=>$res['id_row'],
                           'cell_with_tags'=>strip_tags($res['cell_with_tags']);
    }
}
if(!empty($result)){ 
    foreach($result as $k=>&$v){ // update our database. Write there values without tags
         mysql_query('UPDATE tablename SET cell_with_tags = '.$result[$k]['cell_with_tags'].' WHERE id='.$result[$k]['id_row']);
    }
}

You also can modify call of strip_tags function, if you want to remove not all html tags. (http://ru.php.net/manual/en/function.strip-tags.php)

If executing of php script is not possible, you can do this only with mysql. I get it from this

DROP FUNCTION IF EXISTS strip_tags || 
CREATE FUNCTION strip_tags( x longtext) RETURNS longtext
LANGUAGE SQL NOT DETERMINISTIC READS SQL DATA 
BEGIN
   DECLARE sstart INT UNSIGNED;
   DECLARE ends INT UNSIGNED;
   SET sstart = LOCATE('<', x, 1);
   REPEAT
      SET ends = LOCATE('>', x, sstart);
      SET x = CONCAT(SUBSTRING( x, 1 ,sstart -1) ,SUBSTRING(x, ends +1 )) ; 
      SET sstart = LOCATE('<', x, 1);
   UNTIL sstart < 1 END REPEAT;
return x;
END;
||
delimiter ;

mysql> SELECT strip_tags('<a href="HelloWorld.html"><B>Hi, mate!</B></a>') as strip_tags;

Upvotes: 2

Related Questions