Jerodev
Jerodev

Reputation: 33186

Get javascript code from html file

I am trying to get the javascript code from an html file using C# and regular expressions. The code I use now is the following:

string js = Regex.Replace(code, @"^.*?\<script\s?.*?\>((.|\r\n)+?)\<\/script\>.*$", "$1", RegexOptions.Multiline);

But when I use this I get the full html code with the script-tags stripped.

Can someone help me with this?


I use the html agility pack now with the following code:

var hwObject = new HtmlWeb();
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(code);
        foreach (var script in doc.DocumentNode.Descendants("script").ToArray())
        {
            string js = script.InnerText;
            HtmlTextNode text = (HtmlTextNode)script.ChildNodes.Single(d => d.NodeType == HtmlNodeType.Text);
            text.Text = TrimJs(js);
        }

But only the last script tag get's replaced. The javascripts before just disappear.

Upvotes: 3

Views: 2986

Answers (3)

Ryan Gross
Ryan Gross

Reputation: 6515

You should take a look at Html Agility Pack.

It is generally much easier to parse HTML using an xml based parser than using regular expressions.

You could use something like this:

HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://www...");
foreach(var script in doc.DocumentNode.Descendants("script").ToArray()) 
{ 
    string s = script.InnerText;
    // Modify s somehow
    HtmlTextNode text = (HtmlTextNode)script.ChildNodes
                        .Single(d => d.NodeType == HtmlNodeType.Text);
    text.Text = s;
}
htmldocObject .Save("file.htm");

Upvotes: 9

Johny Skovdal
Johny Skovdal

Reputation: 2104

You need to remove the "^.*?" and ".*$", as this is why everything is included, and there is no reason to use Replace when you are looking for a substring. Just use the Regex.Match method and you should be good to go.

Upvotes: 2

Thaddee Tyl
Thaddee Tyl

Reputation: 1214

Drop the .* (use the following regexp: \<script\s?.*?\>((.|\r\n)+?)\<\/script\>)

Upvotes: 0

Related Questions