Roslyn: enumerate exact token + trivia spans on a single source line?

Question

I am looking to efficiently implement the following method:

IEnumerable GetSyntaxHighlightedSpansOnLine(int lineNumber);

I have a Document, SourceText, SyntaxTree et al. Assume ColoredSpan is a tuple of some color and string (or other source of chars). For the third line of this code for example:

namespace Foo
{ /* Badly formatted coment...
    which continues here... */ class Bar : public IBaz // TODO: rename classes
    {
        ...

I am looking to deliver enumerable results with text:

"    ", "which continues here... */", " ", "class", " ", "Bar", " ",
":", " ", "public", " ", "IBaz", " ", "// TODO: rename classes", "
"

Note the inclusion of whitespace and comment trivia, and the partial multiline comment.

Another answer points to means of deriving a CSharpSyntaxWalker to walk an entire portion of the AST, but not to efficiently limit traversal to a single line's nodes. On a per line basis this is not efficient and I couldn't readily work out which subsections of e.g. Roslyn "trivia" (e.g. multiline comments) to return. It also returns overlapping nodes (namespaces for example).

I have tried code as in this answer, a la:

var lineSpan = sf.GetText().Lines[lineNumber].Span;
var nodes = syntaxTree.GetRoot()
                      .DescendantNodes()
                      .Where(x => x.Span.IntersectsWith(lineSpan))

but this returns the entire AST subtree, preorder traversal, which again is inefficient, and also returns overlapping nodes (namespaces for example) and doesn't handle trivia. Other samples work with entire documents/scripts. I also consulted the API documentation which is next to zero.

Does the code analysis API efficiently permit this? Or to implement the method, do I need to traverse the entire AST ahead of time and store a subjectively bulky parallel memory-consuming data structure of my own devising like this answer?

El Zorko · Accepted Answer

Whilst you may be able to reconstruct this data from the AST, a better API for this appears to be available in the form of Microsoft.CodeAnalysis.Classification.Classifier. It looks expensive, however:

For synchronous results you need a Roslyn SemanticModel for the source code you are highlighting, which you can fetch from a Document or a Compilation by calling their GetSemanticModel() method. You can fetch and cache this at the same time that you fetch the SyntaxTree and the SourceText, i.e. as soon as you have the document. You also need a Workspace. Given these, you can call Classifier.GetClassifiedSpans() on demand.

If you can't readily obtain a current SemanticModel you can instead make a call to Classifier.GetClassifiedSpansAsync() which will build a minature model of a particular TextSpan for you.

Either variant provides you with nearly the enumerable you ask for, but not quite.

Firstly, it returns weakly typed classification (class name, keyword, operator, etc.) for each span in the form of a string "enum"; these appear to correspond to const members of the ClassificationTypeNames class, so presumably they are reliable. You can trivially map ClassificationTypeNames.ClassName et al to colors.

Secondly, since this call returns only classified spans there will be missing unclassified spans for, for example, whitespace. You will have to reconstruct the full set of spans including such trivia, which is straightforward if tedious:

IEnumerable DescribeLine(int lineNumber)
{
    var lineSpan = sourceText.Lines[lineNumber].Span;
    var classified = Classifier.GetClassifiedSpans(semanticModel, lineSpan, workspace);
    var cursor = lineSpan.Start;

    // Presuming you need a string rather than a TextSpan.
    Func textOf = x => sourceText.ToString(x);

    if (!classified.Any())
        yield return new ColoredSpan(defaultStyle, textOf(lineSpan));

    foreach (var overlap in classified)
    {
        var classified = overlap.TextSpan.Intersection(lineSpan).Value;

        if (classified.Start > cursor)
        {
            var unclassified = new TextSpan(cursor, classified.Start - cursor);
            cursor = classified.Start;
            yield return new ColoredSpan(defaultStyle, textOf(unclassified));
        }

        var style = StyleFromClassificationType(overlapping.ClassificationType);

        yield return new ColoredSpan(style, textOf((TextSpan)classified));

        cursor = classified.Start + classified.Length;
    }

    if (cursor < lineSpan.Start + lineSpan.Length)
    {
        var trailing = new TextSpan(cursor, lineSpan.Start + lineSpan.Length - cursor);
        yield return new ColoredSpan(defaultStyle, textOf(trailing));
    }
}

This code presumes the existence of ColoredSpan (as in your question) and a StyleFromClassificationType() helper which maps ClassificationTypeNames to colors.

Since Roslyn lacks any API documentation at this time which might convey the authors' intent for these APIs, I'd advise measuring performance before using this implementation with vim and vigor.

If profiling showed this was unduly expensive, it would be relatively trivial to cache n most recently viewed source lines representation in this format, and recompute where needed, invalidating that cache if/when the source code changes.

Roslyn: enumerate exact token + trivia spans on a single source line?

Answers (1)

Related Questions