Gerald
Gerald

Reputation: 23499

Options for parsing/processing C++ files

So I have a need to be able to parse some relatively simple C++ files with annotations and generate additional source files from that.

As an example, I may have something like this:

//@ service
struct MyService
{
   int getVal() const;
};

I will need to find the //@ service annotation, and get a description of the structure that follows it.

I am looking at possibly leveraging LLVM/Clang since it seems to have library support for embedding compiler/parsing functionality in third-party applications. But I'm really pretty clueless as far as parsing source code goes, so I'm not sure what exactly I would need to look for, or where to start.

I understand that ASTs are at the core of language representations, and there is library support for generating an AST from source files in Clang. But comments would not really be part of an AST right? So what would be a good way of finding the representation of a structure that follows a specific comment annotation?

I'm not too worried about handling cases where the annotation would appear in an inappropriate place as it will only be used to parse C++ files that are specifically written for this application. But of course the more robust I can make it, the better.

Upvotes: 4

Views: 498

Answers (3)

Tom
Tom

Reputation: 2389

I did some very similar work recently. The research I did indicated that there wasn't any out-of-the-box solutions available already, so I ended up hand-rolling one.

The other answers are dead-on regarding parsing C++ code. I needed something that could get ~90% of C++ code parsed correctly; I ended up using srcML. This tool takes C++ or Java source code and converts it to an XML document, which makes it easier for you to parse. It keeps the comments in-tact. Furthermore, if you need to do a source code transformation, it comes with an reverse tool which will take the XML document and produce source code.

It works in 90% of the cases correctly, but it trips on complicated template metaprogramming and the darkest corners of C++ parsing. Fortunately, my input source code is fairly consistent in design (not a lot of C++ trickery), so it works for us.

Other items to look at include gcc-xml and reflex (which actually uses gcc-xml). I'm not sure if GCC-XML preserves comments or not, but it does preserve GCC attributes and pragmas.

One last item to look at is this blog on writing GCC plugins, written by the author of the CodeSynthesis ODB tool.

Good luck!

Upvotes: 1

Maxim Egorushkin
Maxim Egorushkin

Reputation: 136286

One way I've been doing this is annotating identifiers of:

  • classes
  • base classes
  • class members
  • enumerations
  • enumerators

E.g.:

class /* @ann-class */ MyClass 
    : /* @ann-base-class */ MyBaseClass
{
    int /* @ann-member */ member_;
};

Such annotation makes it easy to write a python or perl script that reads the header line by line and extracts the annotation and the associated identifier.

The annotation and the associated identifier make it possible to generate C++ reflection in the form of function templates that traverse objects passing base classes and members to a functor, e.g:

template<class Functor>
void reflect(MyClass& obj, Functor f) {
    f.on_object_start(obj);
    f.on_base_subobject(static_cast<MyBaseClass&>(obj));
    f.on_member(obj.member_);
    f.on_object_end(obj);
}

It is also handy to generate numeric ids (enumeration) for each base class and member and pass that to the functor, e.g:

    f.on_base_subobject(static_cast<MyBaseClass&>(obj), BaseClassIndex<MyClass>::MyBaseClass);
    f.on_member(obj.member_, MemberIndex<MyClass>::member_);

Such reflection code allows to write functors that serialize and de-serialize any object type to/from a number of different formats. Functors use function overloading and/or type deduction to treat different types appropriately.

Upvotes: 4

jfs
jfs

Reputation: 414305

Parsing C++ code is an extremely complex task. Leveraging a C++ compiler might help but it could be beneficial to restrict yourself to a more domain-specific less-powerful format i.e., to generate the source and additional C++ files from a simpler representation something like protobufs proto files or SOAP's WSDL or even simpler in your specific case.

Upvotes: 2

Related Questions