Reputation: 9009
I would like to index source code using Lucene. The source code has already been pre-analysed using a compiler plugin. The output of the compiler is a list of IDs that appear in the source code. Each ID includes information about
For example, given this source code module (in pseudo-code)
module MyModule
from MyOtherModule import bar
foo = ...
print bar
here's what the compiler might output when compiling MyModule
:
MyModule.foo,3:1-3:3,definition
MyOtherModule.bar,4:7-4:9,use
Note how all IDs that appear in the output are fully qualified, even though they might not appear that way in the source. This is why we use a compiler, it allows us to do more exact code search than just purely text-based search.
Question: Is it possible to write a custom tokenizer and analyzer that indexes the compiler output shown above in a way that the metadata (i.e. the fully qualified ID and whether the ID was defined or used at the given location) is kept an available when scoring the documents?
To be more precise, I'd like each term to be associated with the module where it was defined (e.g. foo
would have associated metadata: defining module=MyModule
). I want each posting in the posting list to store whether this particular appearance of an ID was a definition or a use of that ID.
In addition, I'd like to have Lucene store the non-qualified ID as synonyms for the qualified ID. This would allow users to search for "foo" and retrieve all documents that contain the IDs "Module1.foo" and "Module2.foo".
Upvotes: 1
Views: 477
Reputation: 5703
It's probably easier to put the various attributes into Lucene fields, so that you can query like:
parse module:MyModule use:yes
which would return only hits on 'parse' in 'MyModule' where 'parse' was used rather than defined.
Upvotes: 2