Subha
Subha

Reputation: 761

Remove HTML tags from record

Need help to form the MYSQL query from table one column having the bellow content

Row1 : this is first <a href='mytext.txt'>row</a> from the table

Row 2 : THis is the second row <img src ='mytext.jpg'> my image is there

Row 3 : <p>This is the Third row my mytext is there </p>

Row 4 : <p class='te_mytext'>This is the Third row my text is there </p>

this is the table rows i try to search the keyword as 'mytext'

my query is

SELECT * from table  WHERE colmn_name ` like '%mytext%' "

I will get all the 4 rows as result but the result is wrong. I need to get the correct output as only Row 3. The reason this row only having the mytext inside the content all other are not in content but mytext having in all rows

How can I write the MySQL query?

Upvotes: 34

Views: 106091

Answers (10)

TarakPrajapati
TarakPrajapati

Reputation: 479

I think there is no need for such extraordinary logic. It can be simplified like...

-- set @StrRow = 'this is first <a href="mytext.txt">row</a> from the table';
-- set @StrRow = 'THis is the second row <img src ="mytext.jpg"> my image is there';
 set @StrRow = '<p>This is the Third row  my mytext is there  </p>';
-- set @StrRow = '<p class="te_mytext">This is the Third row  my text is there  </p>';

set @MyText = 'mytext';

select locate('<', @StrRow, locate(@MyText, @StrRow)) as '<', locate(@MyText, @StrRow) as MyText, locate('>', @StrRow, locate(@MyText, @StrRow)) as '>'
from xyz
 where
 locate('<', @StrRow, locate(@MyText, @StrRow)) > 0 and 
 locate('<', @StrRow, locate(@MyText, @StrRow)) < locate('>', @StrRow, locate(@MyText, @StrRow))

Upvotes: 2

Rembert Oldenboom
Rembert Oldenboom

Reputation: 21

Expanded the strip_tags function as provided by Boann. It now can be used to either keep or ignore the phrase between the tags. Note the bug with empty tags in my part of the code, ie. for $keep_phrase = false.

    CREATE FUNCTION strip_tags($str text, $tag text,$keep_phrase bool) RETURNS text
    BEGIN
        DECLARE $start, $end INT DEFAULT 1;
        SET $str = COALESCE($str, '');
        LOOP
            SET $start = LOCATE(CONCAT('<', $tag), $str, $start);
            IF (!$start) THEN RETURN $str; END IF;
            IF ($keep_phrase) THEN
                SET $end = LOCATE('>', $str, $start);
                IF (!$end) THEN SET $end = $start; END IF;
                SET $str = INSERT($str, $start, $end - $start + 1, '');
                SET $str = REPLACE($str, CONCAT('</', $tag, '>'), '');
            ELSE
                SET $end = LOCATE(CONCAT('</', $tag, '>'),$str,$start);
                IF (!$end) THEN 
                    SET $end = LOCATE('/>',$str,$start); 
                    SET $str = INSERT($str, $start, $end - $start + 2, '');
                ELSE 
                    SET $str = INSERT($str, $start, $end - $start 
                       + LENGTH(CONCAT('</', $tag, '>')), '');
                END IF;
            END IF;
        END LOOP;
    END //

To prove the functionality:

    SELECT strip_tags('<p>so<span id="x"> very</span> cool</p><p>so<span id="y"> very</span> cool</p>','span',true);
    <p>so very cool</p><p>so very cool</p>

    SELECT strip_tags('<p>so<span id="x"> very</span> cool</p><p>so<span id="y"> very</span> cool</p>','span',false);
    <p>so cool</p><p>so cool</p>

Empty elements with $keep_phrase = false is not supported, see:

    SELECT strip_tags('<p>so<span id="x"> very</span> cool</p><span/><p>so<span id="y"> very</span> cool</p>','span',false);
    <p>so cool</p> cool</p>

Upvotes: 2

user5779136
user5779136

Reputation:

Regular expression matching is needed here, with negative look-ahead assertion: "mytext" not followed by a closing tag. MySQL still, as of 8.0, doesn't support look assertions, but MariaDB does. MariaDB query for the question:

SELECT * FROM table WHERE column_name REGEXP 'mytext(?![^<>]*>)';

Another solution for the question is to remove some / all tags before matching. It's inefficient, compared with REGEXP, but also works. In MySQL starting with 8.0 and MariaDB starting with 10.0.5, there is built-in REGEXP_REPLACE function. 'strip_html' is even the first example in MariaDB corresponding documentation page. MySQL / MariaDB query for such approach:

SELECT * FROM table WHERE REGEXP_REPLACE (column_name, '<.+?>', '') LIKE '%mytext%';

Apart from that, strings in the question mix data and representation. They shouldn't be regularly searched as it is a waste of system resources.

Upvotes: 7

ajmedway
ajmedway

Reputation: 1492

MySQL strip tags implementation, allowing you to target a specific tag so that we can replace out the tags one by one with each function call. You just need pass the tag parameter, e.g. 'a' to replace out all opening/closing anchor tags.

# MySQL function to programmatically replace out specified html tags from text/html fields

# run this to drop/update the stored function
DROP FUNCTION IF EXISTS `strip_tags`;

DELIMITER |

# function to nuke all opening and closing tags of type specified in argument 2
CREATE FUNCTION `strip_tags`($str text, $tag text) RETURNS text
BEGIN
    DECLARE $start, $end INT DEFAULT 1;
    SET $str = COALESCE($str, '');
    LOOP
        SET $start = LOCATE(CONCAT('<', $tag), $str, $start);
        IF (!$start) THEN RETURN $str; END IF;
        SET $end = LOCATE('>', $str, $start);
        IF (!$end) THEN SET $end = $start; END IF;
        SET $str = INSERT($str, $start, $end - $start + 1, '');
        SET $str = REPLACE($str, CONCAT('</', $tag, '>'), '');
    END LOOP;
END;

| DELIMITER ;

# test select to nuke all opening <a> tags
SELECT 
    STRIP_TAGS(description, 'a') AS stripped
FROM
    tmpcat;

# run update query to replace out all <a> tags
UPDATE tmpcat
SET 
    description = STRIP_TAGS(description, 'a');

Upvotes: 2

cucu8
cucu8

Reputation: 931

I used strip_tags() above (Thanks Boann) with a slight mod to be rid of the html codes of accented chars etc. Like so:

...BEGIN
    DECLARE $start, $end INT DEFAULT 1;
    SET $str = REPLACE($str, "&nbsp;", " ");
    SET $str = REPLACE($str, "&euro;", "€");
    SET $str = REPLACE($str, "&aacute;", "á");
    SET $str = REPLACE($str, "&eacute;", "é");
    SET $str = REPLACE($str, "&iacute;", "í");
    SET $str = REPLACE($str, "&oacute;", "ó");
    SET $str = REPLACE($str, "&uacute;", "ú");
LOOP...

Upvotes: 2

ZAky
ZAky

Reputation: 31

Add these lines to fnStripTags function
after SET Dirty = Insert( Dirty, iStart, iLength, '');

set Dirty = Replace(Dirty,'&nbsp;',''); #No space between & and nbsp;
set Dirty = Replace(Dirty,'\r','');
set Dirty = Replace(Dirty,'\n','');

Upvotes: 3

Boann
Boann

Reputation: 50010

Here's my implementation of a strip_tags function:

CREATE FUNCTION `strip_tags`($str text) RETURNS text
BEGIN
    DECLARE $start, $end INT DEFAULT 1;
    LOOP
        SET $start = LOCATE("<", $str, $start);
        IF (!$start) THEN RETURN $str; END IF;
        SET $end = LOCATE(">", $str, $start);
        IF (!$end) THEN SET $end = $start; END IF;
        SET $str = INSERT($str, $start, $end - $start + 1, "");
    END LOOP;
END;

I made sure it removes mismatched opening brackets because they're dangerous, though it ignores any unpaired closing brackets because they're harmless.

mysql> select strip_tags('<span>hel<b>lo <a href="world">wo<>rld</a> <<x>again<.');
+----------------------------------------------------------------------+
| strip_tags('<span>hel<b>lo <a href="world">wo<>rld</a> <<x>again<.') |
+----------------------------------------------------------------------+
| hello world again.                                                   |
+----------------------------------------------------------------------+
1 row in set

Enjoy.

Upvotes: 30

Ryan Ou
Ryan Ou

Reputation: 331

If your content always start with tags(<body> etc.)

try this:

SELECT * from table  WHERE colmn_name REGEXP  '>[^<]*mytext';

Upvotes: 14

KB.
KB.

Reputation: 3657

try this solution: not tried it myself but apparently it works.

source: http://forums.mysql.com/read.php?52,177343,177985#msg-177985

   SET GLOBAL log_bin_trust_function_creators=1;
DROP FUNCTION IF EXISTS fnStripTags;
DELIMITER |
CREATE FUNCTION fnStripTags( Dirty varchar(4000) )
RETURNS varchar(4000)
DETERMINISTIC 
BEGIN
  DECLARE iStart, iEnd, iLength int;
    WHILE Locate( '<', Dirty ) > 0 And Locate( '>', Dirty, Locate( '<', Dirty )) > 0 DO
      BEGIN
        SET iStart = Locate( '<', Dirty ), iEnd = Locate( '>', Dirty, Locate('<', Dirty ));
        SET iLength = ( iEnd - iStart) + 1;
        IF iLength > 0 THEN
          BEGIN
            SET Dirty = Insert( Dirty, iStart, iLength, '');
          END;
        END IF;
      END;
    END WHILE;
    RETURN Dirty;
END;
|
DELIMITER ;
SELECT fnStripTags('this <html>is <b>a test</b>, nothing more</html>');

Upvotes: 55

Egor Pavlikhin
Egor Pavlikhin

Reputation: 17981

You cannot parse HTML inside of SQL queries, that doesn't make any sense. Perhaps you could just maintain a special search version of the table with all the HTML stripped, but you would have to use some external processing to do that.

Upvotes: -3

Related Questions