Reputation: 1519
I have problem with prepare String using Regex. I wrote this function:
private String parseAnswer(String res)
{
String[] pattern = new String[16] { "<head[^>]*?>.*?</head>", "<style[^>]*?>.*?</style>", "<script[^>]*?.*?</script>", "<object[^>]*?.*?</object>", "<embed[^>]*?.*?</embed>", "<applet[^>]*?.*?</applet>", "<noframes[^>]*?.*?</noframes>", "<noscript[^>]*?.*?</noscript>", "<noembed[^>]*?.*?</noembed>", "</?((address)|(blockquote)|(center)|(del))", "</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))", "</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))", "</?((table)|(th)|(td)|(caption))", "</?((form)|(button)|(fieldset)|(legend)|(input))", "</?((label)|(select)|(optgroup)|(option)|(textarea))", "</?((frameset)|(frame)|(iframe))" };
String[] replacement = new String[16] { " ", " ", " ", " ", " ", " ", " ", " ", " ", "\n$0", "\n$0", "\n$0", "\n$0", "\n$0", "\n$0", "\n$0" };
for (int i = 0; i < pattern.Length; i++)
{
res = Regex.Replace(res, pattern[i], replacement[i]);
}
return res;
}
This function get code of HTML as input. I want to clear some of HTML tags. To do it I prepare array of pattern. But it appear that my function doesn't clear code of HTML. My patterns are list of HTML tag which I want to remove. Some of tags I don't remove but only add \n.
Can you help me with this Regex? Or give me any library to do it task? My aim is remove HTML tag to receive only text of website to parse.
EDIT: Ok I can use HTMLAgilityPack but I have a few questions: htmlDoc.LoadHtml(URL); - I need to translate result to UTF8 -> HTMLAgilityPack have any function to convert? Second generally I want to result of InnerText put to Json and send it to Javascript. How I can remove char with are forbidden in Javascript?
Upvotes: 0
Views: 1759
Reputation: 499382
Regex
tends to be a poor choice for parsing HTML, in particular from different sources.
I suggest using a purpose built parser like the HTML Agility Pack instead:
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
The source download come with a number of example projects that document how to use the library for different tasks.
Upvotes: 6