Reputation: 1371
I am stuck in a situation where I have a field in mysql which is a long html field. I need to extract the words between html tags.
Say,
<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>
I need something like this.
"My First Heading My first paragraph"
I am currently doing it in java from an exported csv file using a function like that of the following:
public String getStringFromHtml(String html) {
String nohtml = html.toString().replaceAll("<[^>]*>"," ");
return nohtml.trim().replaceAll("\\s+", " ");
}
But lets assume that I am just using Mysql workbench (no server side scripts) for some data analysis.
I was still wondering whether there are any ways that Mysql allow us to eliminate the html tags and extract the words in between. I tried to search all over stack overflow & google, I was not lucky as there is only advice to use it in PHP or java or stored procedures.
Still there is no way to extract html text using SQL ?
Upvotes: 1
Views: 3552
Reputation: 562260
You can use the ExtractValue() function to give an XPath expression that will pick out the part you need:
mysql> SELECT html FROM mytable;
+----------------------------------------------------------------------------------------------+
| html |
+----------------------------------------------------------------------------------------------+
| <!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html> |
+----------------------------------------------------------------------------------------------+
SELECT ExtractValue(html, '//html/body/p[1]') AS value FROM mytable;
+---------------------+
| value |
+---------------------+
| My first paragraph. |
+---------------------+
Upvotes: 1