Extract HTML text using Mysql

Question

I am stuck in a situation where I have a field in mysql which is a long html field. I need to extract the words between html tags.

Say,

`My First Heading`

My first paragraph.

I need something like this.

"My First Heading My first paragraph"

I am currently doing it in java from an exported csv file using a function like that of the following:

public String getStringFromHtml(String html) {
         String nohtml = html.toString().replaceAll("<[^>]*>"," ");
         return nohtml.trim().replaceAll("\s+", " ");
}

But lets assume that I am just using Mysql workbench (no server side scripts) for some data analysis.

I was still wondering whether there are any ways that Mysql allow us to eliminate the html tags and extract the words in between. I tried to search all over stack overflow & google, I was not lucky as there is only advice to use it in PHP or java or stored procedures.

Still there is no way to extract html text using SQL ?

Bill Karwin · Accepted Answer

You can use the ExtractValue() function to give an XPath expression that will pick out the part you need:

mysql> SELECT html FROM mytable;
+----------------------------------------------------------------------------------------------+
| html                                                                                         |
+----------------------------------------------------------------------------------------------+
| My First Heading
My first paragraph. |
+----------------------------------------------------------------------------------------------+

SELECT ExtractValue(html, '//html/body/p[1]') AS value FROM mytable;
+---------------------+
| value               |
+---------------------+
| My first paragraph. |
+---------------------+

Extract HTML text using Mysql

Answers (1)

Related Questions