How to Write a Source to Source Compiler API

I am doing a little research on source to source compilation but now that I am getting an understanding of Source to Source compilation. I am wondering are there any examples of API's for these source to source compilers.

I mean an Interface Descriptor to pass the source code of one programming language to another compiler to be compile? Please if so can you point me to these examples or could you give me tips (Just pure explanation) on writing one am still in research okay.

Oh I should note I am researching this for several days an I have came across things such as ROSE, DMS and LLVM. As said its purely research so I dont know whats the best approach I know I wouldn't use ROSE for it is only for C/C++. LLVMs' seems promising but I am new to LLVM. Oh my aim is to create a transpiler for 4 language support (Is that feasible). Which is why I just need expert Advice :)

Upvotes: 1

Answers (2)

SujanKh

Reputation: 41

I have been using ROSE compiler framework to write a source to source translator. ROSE can parse a language that it supports and create an AST from it. It provides different APIs (found in SageInterface) to perform transformation and analysis on the AST. After the transformation, you can unparse the transformed AST to produce your target source code.

If ROSE does not support parsing your input language, you can write your own parser while utilizing ROSE's SageBuilder API to build the AST. If your target language is one of the languages which ROSE supports, then you can rely on ROSE's unparser to get the target code. But if ROSE does not support your target language, then you can write your own unparser as well using different AST traversal mechanism provided by ROSE.

Upvotes: 1

Ira Baxter

Reputation: 95400

Yes, you can have a procedural API for doing source-to-source translation. These are pretty straightforward in the abstract: define a core data structure to represent AST nodes, then define APIs to "parse file to AST", "visit tree nodes", "inspect tree nodes", "modify tree nodes", "spit out text". They get messy in the concrete, especially if the API is specific the language being translated; too much of the details of that language get wound into the APIs. While traditional, this is really a rather clumsy way to define source-to-source translators, because you then have to write tons of procedural code invoking the APIs to do the translation.

You can instead define them using a program transformation system (PTS) using source to source transformations based on surface syntax; these are patterns written using the notation of your to-be-compiled language, and your target-language, in the form of "if you see this, then replace it by that", operating on syntax trees not text strings. This means you can inspect the transforms simply by staring at them. So can your fellow programmer.

One such translation rule might look like:

 rule tranlate_add_to(t: access_path, u: access_path):COBOL -> Java
      " add \t to \u "
 ->   "  \object_for\(\u\).\u += \object_for\(\t\).\t; ";

with a left-hand side "add \t to \u " specifying a COBOL fragment (this) to be replaced by the right-hand side " \object_for... " representing corresponding Java code (that). This rule uses a helper function "object_for" to decide where in a target Java program, a global variable in a the source COBOL program will be placed. (There's no avoiding writing such a function if you are translating Java to COBOL. You can argue about how sophisticated). In practice, the way such a rule works is the pattern ASTs of each side are constructed, and then the patterns are matched against a parsed AST; a match causes the corresponding subtree to be spliced into place where the match was found. (All this low level tree matching and splicing has to be done... procedurally, but somebody else has already implemented that in a PTS).

In our experience, you need one to two thousand such rules to translate one language to another. The plethora of rules comes from the combinatorics of language syntax constructs for the source language (and their perhaps different interpretations according to types; "a+b" means different things when a is an int vs when a is a string) and the target language opportunities. A nice plus of such rewrites is that one can build a somewhat simpler base translation, and apply additional rewrites from the target language to itself to clean up and optimize the translated result.

Many PTS are purely based on source-to-source surface syntax rewrites. We have found that combining both PTS and a procedural API, and making it possible to segue between them makes for very nice tool: you can use the rewrites where convenient, and procedural APIs where they don't work so well (the "object_for" function suggested above is easier to code as a procedure).

See lot more detail on how our DMS Software Reengineering Toolkit encodes such transformation rules (the one above is code in DMS style), in a language agnostic (well, parameterized) fashion. DMS offers a "pure" procedural API as OP requested with some 400 functions, but DMS encourages its users to lean heavily on the rewrites and only code as a little as necessary agains the procedural API. It would be "straightforward" (at least as straightforward as practical) to build your "4 language support" this way.

Don't underestimate the amount of effort to build such translators, even with a lot of good technical machinery as a foundation. Langauges tend to be complex beasts, and their translations doubly so. And you have to decide if you want a truly crummy translation or a good one.

Upvotes: 4

How to Write a Source to Source Compiler API

Answers (2)

Related Questions