Logan
Logan

Reputation: 1371

Extract HTML text using Mysql

I am stuck in a situation where I have a field in mysql which is a long html field. I need to extract the words between html tags.

Say,

<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>

I need something like this.

"My First Heading My first paragraph"

I am currently doing it in java from an exported csv file using a function like that of the following:

public String getStringFromHtml(String html) {
         String nohtml = html.toString().replaceAll("<[^>]*>"," ");
         return nohtml.trim().replaceAll("\\s+", " ");
}

But lets assume that I am just using Mysql workbench (no server side scripts) for some data analysis.

I was still wondering whether there are any ways that Mysql allow us to eliminate the html tags and extract the words in between. I tried to search all over stack overflow & google, I was not lucky as there is only advice to use it in PHP or java or stored procedures.

Still there is no way to extract html text using SQL ?

Upvotes: 1

Views: 3552

Answers (1)

Bill Karwin
Bill Karwin

Reputation: 562260

You can use the ExtractValue() function to give an XPath expression that will pick out the part you need:

mysql> SELECT html FROM mytable;
+----------------------------------------------------------------------------------------------+
| html                                                                                         |
+----------------------------------------------------------------------------------------------+
| <!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html> |
+----------------------------------------------------------------------------------------------+

SELECT ExtractValue(html, '//html/body/p[1]') AS value FROM mytable;
+---------------------+
| value               |
+---------------------+
| My first paragraph. |
+---------------------+

Upvotes: 1

Related Questions