Reputation: 93
I have a field in my database (SQL Server) with the type of nvarchar(max) where its values are HTML (say an email's content).
I want to search through these values by a given text. I use .net core
and Entity framework
to make queries.
Of course, I can simply use Contains
function but that also includes HTML tags. For example, if I search for the text "strong", it also matches those values that have strong HTML tag though they might not have the word strong in their content, which is not desirable.
var query = _dbContext.Emails.AsQueryable();
// The following code also searches through HTML tags
if(!string.IsNullOrWhiteSpace(searchText))
{
query=query.Where(ent=>ent.Contents.Contains(searchText));
}
Is there an easy way to avoid HTML tags in the search, for example using Regex?
UPDATE
Ideally I want to do this in SQL server side when the query is IQueryable
not IEnumerable
, i.e., before calling .ToList()
.
Upvotes: 2
Views: 841
Reputation: 93
Thanks to @Amine and @lollmbaowtfidgafgtfoohwtbs, I figured out how to do this.
First, I created a SQL function in my database that strips a given text:
CREATE FUNCTION [dbo].[ufnStripHTML] (@HTMLText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX) AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO
Then I added a reference to that function in my DbContext
:
[DbFunction("ufnStripHTML")]
public static string StripHTML(string text)
{
throw new Exception("not implemented");
}
and now I can use it in my Linq to SQL
query:
if (!string.IsNullOrWhiteSpace(searchText))
{
query = query.Where(ent => TGDbContext.StripHTML(ent.Contents).Contains(searchText));
}
Upvotes: 1
Reputation: 21
See if you can use free and open source HtmlAgilityPack through which you can first convert the html text to plain text and then apply the search criteria.
E.g.: var plainTextResult = HtmlUtilities.ConvertToPlainText(string html);
if(!string.IsNullOrWhiteSpace(searchText))
{
bool containsResult = plainTextResult.Contains(searchText);
}
Upvotes: 1
Reputation: 395
You can use the regex function as am using below:
"<div>Hello</div><span>World</span>".replace(/<[^>]*>/g, '')
Upvotes: -1
Reputation: 979
I think there is no easy answer in your case, but you have several possibilities. Regex can be a simple solution if you do not want to get the full tags. A more deeper but more complex would be to use a package like HtmlAgilityPack to parse your mail.
Here is an example of the regex :
var searchWord = "strong";
var mail = "<strong>blablabla</strong><p>blabla strong blabla</p>";
var rgx = new Regex($"(?!<){searchWord}(?!>)"); // Will match strong but not <strong> or <strong or strong>
if (rgx.IsMatch(mail))
{
// Do what you want
}
Upvotes: 1