Reputation: 10865

Get the source code from an html file

I am wondering if you could please help with generating .cpp/.h file from the following html file in a programmatic way (using whatever scripting language, or programming language, or even using editors such as vi or emacs):

<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<title>Class</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body link="blue" vlink="purple" bgcolor="#FFFABB" text="black">

<h2><font face="Helvetica">Code Fragment: Class</font></h2>
</center><br><dl><dd><pre>

  <font color=#A000A0>template</font> &lt;<font color=#A000A0>typename</font> G&gt;
  <font color=#A000A0>class</font> Components : <font color=#A000A0>public</font> DFS&lt;G&gt; {            <font color=#0000FF>// count components</font>
  <font color=#A000A0>private</font>:
    <font color=#A000A0>int</font> nComponents;                 <font color=#0000FF>// num of components</font>
  <font color=#A000A0>public</font>:
    <font color=#000000>Components</font>(<font color=#A000A0>const</font> G& g): DFS&lt;G&gt;(g) {}        <font color=#0000FF>// constructor</font>
    <font color=#A000A0>int</font> <font color=#A000A0>operator</font>()();                 <font color=#0000FF>// count components</font>
  };
</dl>

</body>
</html>

If you could please point out how this was done in the other direction too, that would be great. Thanks a lot.

Upvotes: 0

Answers (6)

JRL

Reputation: 78033

PHP script:

$doc = new DOMDocument();
$doc->loadHTMLFile("file.html");
$xpath = new DOMXpath($doc);
$str = '';
foreach ($xpath->query("//dl//text()") as $node) {
    $str .= $node->nodeValue . ' ';
}

file_put_contents('file.cpp', $str);

contents of file.cpp:

   template  < typename  G>
   class  Components :  public  DFS<G> {             // count components 
   private :
     int  nComponents;                  // num of components 
   public :
     Components ( const  G& g): DFS<G>(g) {}         // constructor 
     int   operator ()();                  // count components 
  };

Upvotes: 2

Matteo Italia

Reputation: 126957

Another option for going from HTML to the source code is the html2text utility, that is often found installed in many Linux distributions.

matteo@teomint:~/Desktop$ html2text out.html 
***** Code Fragment: Class *****


        template <typename G>
        class Components : public DFS<G> {            // count components
        private:
          int nComponents;                 // num of components
        public:
          Components(const G& g): DFS<G>(g) {}        // constructor
          int operator()();                 // count components
        };

Upvotes: 1

jman

Reputation: 11626

Does this work for you?

[18:56:44 jaidev@~]$ lynx --dump foo.html
Code Fragment: Class


  template <typename G>
  class Components : public DFS<G> {            // count components
  private:
    int nComponents;                 // num of components
  public:
    Components(const G& g): DFS<G>(g) {}        // constructor
    int operator()();                 // count components
  };
[18:56:49 jaidev@~]$

Edit:

For the reverse direction. If you use vim as your editor, you can enter :TOhtml to generate a syntax highlighted HTML version of your code in a new buffer. It generates a html based on your vim colorscheme. To change the colorscheme, use the :colorscheme <name> command.

Upvotes: 8

executifs

Reputation: 1178

You could use regular expressions to...

...keep only what's in the <body> of the HTML page,
...strip all the HTML tags (everything that looks like <.*> should be removed from the file).
...unescape special characters such as <, >, & etc.

What's left should be the code you're looking for.

Upvotes: 1

kestrel

Reputation: 1344

If you're trying to strip all HTML tags to get back the original, non-highlighted source code, then you have a two options that I can think of:

Parse the DOM tree and just grab all relevant text.
Use some regular expressions to remove the tags themselves. For example, maybe "s///" would be a good start?

Upvotes: 0

Lightness Races in Orbit

Reputation: 385405

Fix the HTML. You're missing some closing tags.
Get PHP out
- Obtain the pre code block with DOMDocument
- strip_tags() from the result
Profit.

Upvotes: 0

Get the source code from an html file

Answers (6)

Related Questions