tibbe
tibbe

Reputation: 9009

Can I index source code using Lucene?

I would like to index source code using Lucene. The source code has already been pre-analysed using a compiler plugin. The output of the compiler is a list of IDs that appear in the source code. Each ID includes information about

For example, given this source code module (in pseudo-code)

module MyModule
from MyOtherModule import bar
foo = ...
print bar

here's what the compiler might output when compiling MyModule:

MyModule.foo,3:1-3:3,definition
MyOtherModule.bar,4:7-4:9,use

Note how all IDs that appear in the output are fully qualified, even though they might not appear that way in the source. This is why we use a compiler, it allows us to do more exact code search than just purely text-based search.

Question: Is it possible to write a custom tokenizer and analyzer that indexes the compiler output shown above in a way that the metadata (i.e. the fully qualified ID and whether the ID was defined or used at the given location) is kept an available when scoring the documents?

To be more precise, I'd like each term to be associated with the module where it was defined (e.g. foo would have associated metadata: defining module=MyModule). I want each posting in the posting list to store whether this particular appearance of an ID was a definition or a use of that ID.

In addition, I'd like to have Lucene store the non-qualified ID as synonyms for the qualified ID. This would allow users to search for "foo" and retrieve all documents that contain the IDs "Module1.foo" and "Module2.foo".

Upvotes: 1

Views: 477

Answers (1)

Mark Leighton Fisher
Mark Leighton Fisher

Reputation: 5703

It's probably easier to put the various attributes into Lucene fields, so that you can query like:

parse module:MyModule use:yes

which would return only hits on 'parse' in 'MyModule' where 'parse' was used rather than defined.

Upvotes: 2

Related Questions