franco
franco

Reputation: 2047

Extracting text between html tags

I used postgres

I have this query

 SELECT 
 row_number() OVER (ORDER BY corresp.ID_CORRESP) as rNUM ,
 transfers.id_transfer AS TRANSFER_ID_TRANSFER, 
 corresp.id_corresp as  ID_CORRESP, 
 corresp.ORDERNBR_CORRESP as  ORDERNBR_CORRESP, 
 transfers.text_transfer AS TEXT 
FROM Transfers transfers 
 left outer join correspondence corresp  on corresp.id_corresp = transfers.id_corresp 
 left outer join tranf_corresp_tocc_employee on  tranf_corresp_tocc_employee.id_transfer = transfers.id_transfer
 left outer join employee on tranf_corresp_tocc_employee.id_employe = employee.id_employe 
 left outer join employee_lang on employee.id_employe = employee_lang.id_employe
 left outer join unit on employee.id_unit = unit.id_unit 
 left outer join unit_lang on unit_lang.id_unit =unit.id_unit
 left outer join action on action.id_action = transfers.id_action  
 left outer join action_lang on action_lang.id_action = action.id_action 
 WHERE transfers.status_transfer ='P' 

but the problem that transfers.text_transfer AS TEXT return this kind of result

<div align="right"><font color="3366FF"><b><font size="3">it's&nbsp;test</font></b></font></div>

I search the way to extract correct data from this result meaning extract it's test

so I want to add in my query same code to extract data from html tag , I think that I should use this function REGEXP_REPLACE

Updated :

when I try to run this query

CREATE LANGUAGE plperlu;

I have this error :

ERROR:  could not load library "C:/Program Files/PostgreSQL/9.2/lib/plperl.dll": %1 is not a valid Win32 application.


********** Error **********

ERROR: could not load library "C:/Program Files/PostgreSQL/9.2/lib/plperl.dll": %1 is not a valid Win32 application.
SQL state: 58P01

I have plperl.dll under C:/Program Files/PostgreSQL/9.2/lib

updated

I try to use another way with this example :

CREATE FUNCTION testFunction
(@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END

but I have this error :

ERROR:  syntax error at or near "@"
LINE 2: (@HTMLText VARCHAR(MAX))
         ^

********** Error **********

ERROR: syntax error at or near "@"
SQL state: 42601
Character: 31

Upvotes: 0

Views: 1311

Answers (1)

Craig Ringer
Craig Ringer

Reputation: 324375

If you want to do this in the database, use a PL/Perl, PL/Python, or similar tool that does proper HTML stripping.

For example, if you install HTML::Strip from CPAN or the libperl-html-strip (Debian/ubuntu) or perl-HTML-Strip (Fedora/RHEL) package:

CREATE LANGUAGE plperlu;

CREATE OR REPLACE FUNCTION striphtml(html text) RETURNS text
LANGUAGE plperlu
AS $$
use strict; use warnings; use 5.10.1;
use HTML::Strip;

my $hs = HTML::Strip->new(decode_entities => 1);
my $stripped = $hs->parse($_[0]);
$hs->eof;
return $stripped;
$$;

then:

regress=> SELECT striphtml('<div align="right"><font color="3366FF"><b><font size="3">it''s&nbsp;test</font></b></font></div>');
 striphtml 
-----------
 it's test
(1 row)

or you could use HTML::Parser to do it more cleanly.

There are many other alternatives. Pick an existing one and use it.

Upvotes: 1

Related Questions