Reputation: 825
As the title, if I insert some html into the table. How can I use mysql to extract the plaintext between the html tags?
For example:
CREATE TABLE `test`
(
`id` INT PRIMARY KEY,
`plain` LONGTEXT
)
INSERT INTO `text` VALUES (1, '<p>text between tag 'p'</p><span>text between 'span'</span>');
And before insert, I want to use a trigger to extract the plain text I want. How to?
Or use php ? I mean pass the html to a php script and the php script does the extraction.
Yet there's a solution for my problem by using php, how about the mysql one?
Upvotes: 0
Views: 10785
Reputation: 18435
If your content always start with tags( etc.)
try this:
SELECT * from table WHERE colmn_name REGEXP '>[^<]*mytext';
Another way is to use strip_tags
— Strip HTML and PHP tags from a string
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
Output of above code:
Test paragraph. Other text
<p>Test paragraph.</p> <a href="#fragment">Other text</a>
Warning:: Because strip_tags() does not actually validate the HTML, partial or broken tags can result in the removal of more text/data than expected.
You should put the html code in a variable, let's say $html_input
$html_input= "'<p>text between tag 'p'</p><span>text between 'span'</span>'";
$stripped_html = strip_tags($html_input);
// Now insert it into the table `text`
INSERT INTO `text` VALUES (1, $striped_html);
Purely MYSQL
way:
CREATE FUNCTION `strip_tags`($str text) RETURNS text
BEGIN
DECLARE $start, $end INT DEFAULT 1;
LOOP
SET $start = LOCATE("<", $str, $start);
IF (!$start) THEN RETURN $str; END IF;
SET $end = LOCATE(">", $str, $start);
IF (!$end) THEN SET $end = $start; END IF;
SET $str = INSERT($str, $start, $end - $start + 1, "");
END LOOP;
END;
mysql> select strip_tags('<span>hel<b>lo <a href="world">wo<>rld</a> <<x>again<.');
+----------------------------------------------------------------------+
| strip_tags('<span>hel<b>lo <a href="world">wo<>rld</a> <<x>again<.') |
+----------------------------------------------------------------------+
| hello world again. |
+----------------------------------------------------------------------+
1 row in set
Reference: Stackoverflow
Upvotes: 1
Reputation: 8555
So I didn't like any of the existing solutions and I rolled my own that is ultra fast, removes HTML entities, too, and works on the MySQL server itself.
I've written this as a MySQL UDF in C so it can be as efficient as possible. I say "I've written" but it's basically just a mash of things that already existed, including the PHP striptags
and trim
functions straight from the PHP source code, and then the HTML entity decode function from here How to decode HTML Entities in C?.
The reason I didn't also port the PHP htmlspecialchars_decode
function was simply that is was far too complicated for what I needed, with a lot of extra code for handling things like Big5 (Chinese) character encoding. If you need it to be as robust as PHP's htmlspecialchars_decode
, then you might need to port it yourself, but so far my unhtml
MySQL extension has proven very powerful for my use case.
You can download it here, and the install instructions are part of the gist.
//make sure libmysqlclient-dev is installed:
apt-get install libmysqlclient-dev
// Replace "/usr/lib/mysql/plugin" with your MySQL plugins directory (can be found by running "select @@plugin_dir;")
gcc -I/usr/include/mysql -o unhtml.so -shared unhtml.c -fPIC && cp unhtml.so /usr/lib/mysql/plugin/unhtml.so
//Then, on the server:
//create function`unhtml`returns string soname'unhtml.so';
//And use/test like:
select `unhtml`('<b>I'll say, it works!</b>');
https://gist.github.com/BrianLeishman/da6edb36ea2989b2ded533087e9d1511
Upvotes: 0
Reputation: 1322
Updating with some modifications from one of the above answers.
DELIMITER ***
CREATE FUNCTION ConvertHTMLToText(str LONGTEXT CHARSET utf8)
RETURNS LONGTEXT CHARSET utf8
BEGIN
DECLARE start, end INT DEFAULT 1;
LOOP
SET start = LOCATE("<", str, start);
IF (!start) THEN RETURN str; END IF;
SET end = LOCATE(">", str, start);
IF (!end) THEN SET end = start; END IF;
SET str = TRIM(INSERT(str, start, end - start + 1, ""));
END LOOP;
END ***
Upvotes: 0
Reputation: 2603
You can do it in MySQL itself too. Youd need a equivalent of the striptags php Funktion. Take a look here
http://www.sitepoint.com/forums/showthread.php?656167-PHP-s-strip_tags()-equivalent-MYSQL-function
Upvotes: 0
Reputation: 76646
Use the strip_tags()
function on the string, store the result in a variable, and then use that in your query:
$str = "'<p>text between tag 'p'</p><span>text between 'span'</span>'";
$str = strip_tags($str); //'text between tag 'p'text between 'span''
$str = mysql_real_escape_string($str); //just for safety
In your query:
INSERT INTO `text` VALUES (1, $str);
Documention: strip_tags()
, mysql_real_escape_string()
Hope this helps!
Upvotes: 0