How can i parse only the text from a single html line?

Question

I have this line:

From this line i need to get only the hebrew words. To remove all tags and the onmouseover and tooltip and void and only to be left with the words in hebrew and the part: בתאריך: 22.07.14 שעה: 08:56

Or in this case :

Again to be left with all hebrew words and: מתאריך 17.07.14 בשעה 23:20

How can i do it ?

I have this method i used to parse text:

public List CreateTextList(string filePath)
        {
            List text = new List();
            var htmlDoc = new HtmlAgilityPack.HtmlDocument();
            htmlDoc.OptionFixNestedTags = true;
            htmlDoc.Load(filePath, System.Text.Encoding.GetEncoding(65001));

            if (htmlDoc.DocumentNode != null)
            {
                var nodes = htmlDoc.DocumentNode.SelectNodes("//a/b");
                foreach (var node in nodes)
                {
                    text.Add(node.InnerText);

                }
            }
            text = Filters.filterNumbers(text);
            return text;
        }

It's working good but it's getting file not lines/text.

MaPi · Accepted Answer

Well, you can't use an XML parser if you work with lines (you can't traverse the XML tree structure if you don't have the whole structure).

But as suggested here: https://stackoverflow.com/a/19524158/1648371

You can use

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>| ", "").Trim();

For retrieving the strings instead of replacing the HTML characters with an empty space, you can replace them with a special character that you won't have in your input (like the Swedish letter å) and then

Regex.Matches(noHTML, "å", RegexOptions.IgnoreCase))

How can i parse only the text from a single html line?

Answers (2)

Related Questions