Mojtaba
Mojtaba

Reputation: 93

Searching for a text in a string property where the value is HTML

I have a field in my database (SQL Server) with the type of nvarchar(max) where its values are HTML (say an email's content).

I want to search through these values by a given text. I use .net core and Entity framework to make queries.

Of course, I can simply use Contains function but that also includes HTML tags. For example, if I search for the text "strong", it also matches those values that have strong HTML tag though they might not have the word strong in their content, which is not desirable.

var query = _dbContext.Emails.AsQueryable();

// The following code also searches through HTML tags
if(!string.IsNullOrWhiteSpace(searchText))
{
    query=query.Where(ent=>ent.Contents.Contains(searchText));
}

Is there an easy way to avoid HTML tags in the search, for example using Regex?

UPDATE

Ideally I want to do this in SQL server side when the query is IQueryable not IEnumerable, i.e., before calling .ToList().

Upvotes: 2

Views: 841

Answers (4)

Mojtaba
Mojtaba

Reputation: 93

Thanks to @Amine and @lollmbaowtfidgafgtfoohwtbs, I figured out how to do this.

First, I created a SQL function in my database that strips a given text:


CREATE FUNCTION [dbo].[ufnStripHTML] (@HTMLText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX) AS
BEGIN
    DECLARE @Start INT
    DECLARE @End INT
    DECLARE @Length INT
    SET @Start = CHARINDEX('<',@HTMLText)
    SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
    SET @Length = (@End - @Start) + 1
    WHILE @Start > 0 AND @End > 0 AND @Length > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
        SET @Start = CHARINDEX('<',@HTMLText)
        SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
        SET @Length = (@End - @Start) + 1
    END
    RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Then I added a reference to that function in my DbContext:

        [DbFunction("ufnStripHTML")]
        public static string StripHTML(string text)
        {
            throw new Exception("not implemented");
        }

and now I can use it in my Linq to SQL query:


if (!string.IsNullOrWhiteSpace(searchText))
{
    query = query.Where(ent => TGDbContext.StripHTML(ent.Contents).Contains(searchText));
}

Upvotes: 1

aniket
aniket

Reputation: 21

See if you can use free and open source HtmlAgilityPack through which you can first convert the html text to plain text and then apply the search criteria.

E.g.: var plainTextResult = HtmlUtilities.ConvertToPlainText(string html);

if(!string.IsNullOrWhiteSpace(searchText))
{
    bool containsResult = plainTextResult.Contains(searchText);
}

Upvotes: 1

Mahesh S
Mahesh S

Reputation: 395

You can use the regex function as am using below:

"<div>Hello</div><span>World</span>".replace(/<[^>]*>/g, '')

Upvotes: -1

Guilhem Prev
Guilhem Prev

Reputation: 979

I think there is no easy answer in your case, but you have several possibilities. Regex can be a simple solution if you do not want to get the full tags. A more deeper but more complex would be to use a package like HtmlAgilityPack to parse your mail.

Here is an example of the regex :

var searchWord = "strong";
var mail = "<strong>blablabla</strong><p>blabla strong blabla</p>";
var rgx = new Regex($"(?!<){searchWord}(?!>)"); // Will match strong but not <strong> or <strong or strong>
if (rgx.IsMatch(mail))
{
    // Do what you want
}

Upvotes: 1

Related Questions