Ggicci
Ggicci

Reputation: 825

MySQL extract plaintext from html data or PHP?

As the title, if I insert some html into the table. How can I use mysql to extract the plaintext between the html tags?
For example:

CREATE TABLE `test`
(
  `id`    INT PRIMARY KEY,
  `plain` LONGTEXT
)
INSERT INTO `text` VALUES (1, '<p>text between tag 'p'</p><span>text between 'span'</span>');

And before insert, I want to use a trigger to extract the plain text I want. How to?
Or use php ? I mean pass the html to a php script and the php script does the extraction.

update

Yet there's a solution for my problem by using php, how about the mysql one?

Upvotes: 0

Views: 10785

Answers (5)

softvar
softvar

Reputation: 18435

If your content always start with tags( etc.)

try this:

SELECT * from table  WHERE colmn_name REGEXP  '>[^<]*mytext'; 

Another way is to use strip_tags — Strip HTML and PHP tags from a string

<?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";

// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>

Output of above code:

Test paragraph. Other text
<p>Test paragraph.</p> <a href="#fragment">Other text</a>

Warning:: Because strip_tags() does not actually validate the HTML, partial or broken tags can result in the removal of more text/data than expected.

You should put the html code in a variable, let's say $html_input

$html_input= "'<p>text between tag 'p'</p><span>text between 'span'</span>'";
$stripped_html = strip_tags($html_input);

// Now insert it into the table `text`
INSERT INTO `text` VALUES (1, $striped_html);

Purely MYSQL way:

CREATE FUNCTION `strip_tags`($str text) RETURNS text
BEGIN
    DECLARE $start, $end INT DEFAULT 1;
LOOP
    SET $start = LOCATE("<", $str, $start);
    IF (!$start) THEN RETURN $str; END IF;
    SET $end = LOCATE(">", $str, $start);
    IF (!$end) THEN SET $end = $start; END IF;
    SET $str = INSERT($str, $start, $end - $start + 1, "");
END LOOP;
END;

mysql> select strip_tags('<span>hel<b>lo <a href="world">wo<>rld</a> <<x>again<.');
+----------------------------------------------------------------------+
| strip_tags('<span>hel<b>lo <a href="world">wo<>rld</a> <<x>again<.') |
+----------------------------------------------------------------------+
| hello world again.                                                   |
+----------------------------------------------------------------------+
1 row in set

Reference: Stackoverflow

Upvotes: 1

Brian Leishman
Brian Leishman

Reputation: 8555

So I didn't like any of the existing solutions and I rolled my own that is ultra fast, removes HTML entities, too, and works on the MySQL server itself.

I've written this as a MySQL UDF in C so it can be as efficient as possible. I say "I've written" but it's basically just a mash of things that already existed, including the PHP striptags and trim functions straight from the PHP source code, and then the HTML entity decode function from here How to decode HTML Entities in C?.

The reason I didn't also port the PHP htmlspecialchars_decode function was simply that is was far too complicated for what I needed, with a lot of extra code for handling things like Big5 (Chinese) character encoding. If you need it to be as robust as PHP's htmlspecialchars_decode, then you might need to port it yourself, but so far my unhtml MySQL extension has proven very powerful for my use case.

You can download it here, and the install instructions are part of the gist.

//make sure libmysqlclient-dev is installed:
apt-get install libmysqlclient-dev

// Replace "/usr/lib/mysql/plugin" with your MySQL plugins directory (can be found by running "select @@plugin_dir;")
gcc -I/usr/include/mysql -o unhtml.so -shared unhtml.c -fPIC && cp unhtml.so /usr/lib/mysql/plugin/unhtml.so

//Then, on the server:
//create function`unhtml`returns string soname'unhtml.so';

//And use/test like:
select `unhtml`('<b>I&apos;ll say, it works!</b>');

https://gist.github.com/BrianLeishman/da6edb36ea2989b2ded533087e9d1511

Upvotes: 0

Pankaj Garg
Pankaj Garg

Reputation: 1322

Updating with some modifications from one of the above answers.

DELIMITER ***
                                CREATE FUNCTION ConvertHTMLToText(str LONGTEXT CHARSET utf8)
                                RETURNS LONGTEXT CHARSET utf8
                                BEGIN
                                DECLARE start, end INT DEFAULT 1;
                                    LOOP
                                    SET start = LOCATE("<", str, start);
                                    IF (!start) THEN RETURN str; END IF;
                                    SET end = LOCATE(">", str, start);
                                    IF (!end) THEN SET end = start; END IF;
                                    SET str = TRIM(INSERT(str, start, end - start + 1, ""));
                                END LOOP;
                            END ***

Upvotes: 0

zewa666
zewa666

Reputation: 2603

You can do it in MySQL itself too. Youd need a equivalent of the striptags php Funktion. Take a look here

http://www.sitepoint.com/forums/showthread.php?656167-PHP-s-strip_tags()-equivalent-MYSQL-function

Upvotes: 0

Amal Murali
Amal Murali

Reputation: 76646

Use the strip_tags() function on the string, store the result in a variable, and then use that in your query:

$str = "'<p>text between tag 'p'</p><span>text between 'span'</span>'";
$str = strip_tags($str); //'text between tag 'p'text between 'span''
$str = mysql_real_escape_string($str); //just for safety

In your query:

INSERT INTO `text` VALUES (1, $str);

Documention: strip_tags(), mysql_real_escape_string()

Hope this helps!

Upvotes: 0

Related Questions