Can I index source code using Lucene?

Question

I would like to index source code using Lucene. The source code has already been pre-analysed using a compiler plugin. The output of the compiler is a list of IDs that appear in the source code. Each ID includes information about

the module the ID was defined in (as opposed to used in),
the source span where the ID appears (i.e. line:col-line:col), and
whether the ID is defined at this location or merely used here.

For example, given this source code module (in pseudo-code)

module MyModule
from MyOtherModule import bar
foo = ...
print bar

here's what the compiler might output when compiling MyModule:

MyModule.foo,3:1-3:3,definition
MyOtherModule.bar,4:7-4:9,use

Note how all IDs that appear in the output are fully qualified, even though they might not appear that way in the source. This is why we use a compiler, it allows us to do more exact code search than just purely text-based search.

Question: Is it possible to write a custom tokenizer and analyzer that indexes the compiler output shown above in a way that the metadata (i.e. the fully qualified ID and whether the ID was defined or used at the given location) is kept an available when scoring the documents?

To be more precise, I'd like each term to be associated with the module where it was defined (e.g. foo would have associated metadata: defining module=MyModule). I want each posting in the posting list to store whether this particular appearance of an ID was a definition or a use of that ID.

In addition, I'd like to have Lucene store the non-qualified ID as synonyms for the qualified ID. This would allow users to search for "foo" and retrieve all documents that contain the IDs "Module1.foo" and "Module2.foo".

Can I index source code using Lucene?

Answers (1)

Related Questions