Janko
Janko

Reputation: 9335

How to tokenize a code snippet using a Textmate grammar in Node

I'm trying to syntax-highlight code snippets on my library website. I've tried Highlight.js and Prism, but neither of them tokenize the code correctly (it's Ruby), so in the end the code is not syntax-highlighted properly. This is because they both implement their own tokenization regexes, which is an approach that's bound to have flaws.

I've then discovered that GitHub, Atom and VSCode all use TextMate grammars for their tokenization. This to me sounds like the right approach, to have language grammars maintained in a single place, so that other tools can then reuse them instead of each defining their own.

My question is: how to tokenize a code string using a TextMate grammar in Node? My goal is to have something like:

const codeSnippet = `
class Foo
  def bar
    puts "baz"
  end
end
`

const tokenized = tokenizeCode(codeSnippet, 'ruby')

tokenized // some kind of array of tokens, e.g:
// [
//   ['keyword', 'class'],
//   ['whitespace', ' '],
//   ['class', 'Foo'],
//   ...
// ]

I've tried vscode-textmate, which is what VSCode seems to use for its own syntax highlighting. However, I couldn't figure out how to use it to achive the functionality above.

Ultimately I want to end up with HTML that I can syntax-highlight:

<pre>
  <code>
    <span class="token kewyord">class</span> <span class="token class">Foo</span>
    <!-- ... -->
  </code>
</pre>

Again, I've tried highlight.js and Prism, but they both incorrectly tokenize even the simplest Ruby code.

Edit

Here are some examples where Prism and Highlight.js incorrectly tokenize Ruby code:

Upvotes: 5

Views: 2248

Answers (2)

Janko
Janko

Reputation: 9335

I've found the Highlights package under the Atom organization, which uses TextMate grammars and produces tokenized markup. It also has a synchronous API, which I need for integrating with Remarkable.

const Highlights = require("highlights")

const highlighter = new Highlights()

const html = highlighter.highlightSync({
  fileContents: 'answer = 42',
  scopeName: 'source.ruby',
})

html //=>
// <pre class="editor editor-colors">
//   <div class="line">
//     <span class="source ruby">
//       <span>answer&nbsp;</span>
//       <span class="keyword operator assignment ruby">
//         <span>=</span>
//       </span>
//       <span>&nbsp;</span>
//       <span class="constant numeric ruby">
//         <span>42</span>
//       </span>
//     </span> 
//   </div>
// </pre>

Under the hood it uses First Mate for tokenization, which is an alternative to vscode-texmate, but with much easier usage:

const { GrammarRegistry } = require('first-mate')

const registry = new GrammarRegistry()
const grammar = registry.loadGrammarSync('./ruby.cson')

const tokens = grammar.tokenizeLines('answer = 42') // does all the work

tokens[0] //=>
// [ { value: 'answer ', scopes: [ 'source.ruby' ] },
//   { value: '=',
//     scopes: [ 'source.ruby', 'keyword.operator.assignment.ruby' ] },
//   { value: ' ', scopes: [ 'source.ruby' ] },
//   { value: '42',
//     scopes: [ 'source.ruby', 'constant.numeric.ruby' ] } ]

Upvotes: 1

Christian Ivicevic
Christian Ivicevic

Reputation: 10895

After posting my comment I gave it another try and was sucessful this time around. The following example shows how to use vscode-textmate with the official TypeScript.tmLanguage but the basics should be applicable to other languages.

  1. First make sure you have Python 2.7 (not 3.X) installed on your machine and on Windows in your PATH variable.
  2. Install vscode-textmate using npm or yarn which will invoke the required Python interpreter during installation.
  3. Grab your XML grammar (usually ending in .tmLanguage) and place it in the project root.
  4. Use the vscode-textmate plugin as follows:
import * as fs from "fs";
import { INITIAL, parseRawGrammar, Registry } from "vscode-textmate";

const registry = new Registry({
    // eslint-disable-next-line @typescript-eslint/require-await
    loadGrammar: async (scopeName) => {
        if (scopeName === "source.ts") {
            return new Promise<string>((resolve, reject) =>
                fs.readFile("./grammars/TypeScript.tmLanguage", (error, data) =>
                    error !== null ? reject(error) : resolve(data.toString())
                )
            ).then((data) => parseRawGrammar(data));
        }
        console.info(`Unknown scope: ${scopeName}`);
        return null;
    },
});

registry.loadGrammar("source.ts").then(
    (grammar) => {
        fs.readFileSync("./samples/test.ts")
            .toString()
            .split("\n")
            .reduce((previousRuleStack, line) => {
                console.info(`Tokenizing line: ${line}`);
                const { ruleStack, tokens } = grammar.tokenizeLine(line, previousRuleStack);
                tokens.forEach((token) => {
                    console.info(
                        ` - ${token.startIndex}-${token.endIndex} (${line.substring(
                            token.startIndex,
                            token.endIndex
                        )}) with scopes ${token.scopes.join(", ")}`
                    );
                });
                return ruleStack;
            }, INITIAL);
    },
    (error) => {
        console.error(error);
    }
);

Keep in mind that the source.ts string is not referring to a file, it is the scope name from the grammar file. Most likely it'd be source.ruby in your case. Furthermore the snippet it not optimized and barely readable, but you should get the gist of how to use the plugin in the first place.

After extracting the tokens you can then map them accordingly based on your requirements.

The output in my snippet looks as follows:

Output Sample

Upvotes: 2

Related Questions