Reputation: 33186
I am trying to get the javascript code from an html file using C# and regular expressions. The code I use now is the following:
string js = Regex.Replace(code, @"^.*?\<script\s?.*?\>((.|\r\n)+?)\<\/script\>.*$", "$1", RegexOptions.Multiline);
But when I use this I get the full html code with the script-tags stripped.
Can someone help me with this?
I use the html agility pack now with the following code:
var hwObject = new HtmlWeb();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(code);
foreach (var script in doc.DocumentNode.Descendants("script").ToArray())
{
string js = script.InnerText;
HtmlTextNode text = (HtmlTextNode)script.ChildNodes.Single(d => d.NodeType == HtmlNodeType.Text);
text.Text = TrimJs(js);
}
But only the last script tag get's replaced. The javascripts before just disappear.
Upvotes: 3
Views: 2986
Reputation: 6515
You should take a look at Html Agility Pack.
It is generally much easier to parse HTML using an xml based parser than using regular expressions.
You could use something like this:
HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://www...");
foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
{
string s = script.InnerText;
// Modify s somehow
HtmlTextNode text = (HtmlTextNode)script.ChildNodes
.Single(d => d.NodeType == HtmlNodeType.Text);
text.Text = s;
}
htmldocObject .Save("file.htm");
Upvotes: 9
Reputation: 2104
You need to remove the "^.*?"
and ".*$"
, as this is why everything is included, and there is no reason to use Replace when you are looking for a substring. Just use the Regex.Match method and you should be good to go.
Upvotes: 2
Reputation: 1214
Drop the .* (use the following regexp: \<script\s?.*?\>((.|\r\n)+?)\<\/script\>
)
Upvotes: 0