How to tokenize a code snippet using a Textmate grammar in Node

Question

I'm trying to syntax-highlight code snippets on my library website. I've tried Highlight.js and Prism, but neither of them tokenize the code correctly (it's Ruby), so in the end the code is not syntax-highlighted properly. This is because they both implement their own tokenization regexes, which is an approach that's bound to have flaws.

I've then discovered that GitHub, Atom and VSCode all use TextMate grammars for their tokenization. This to me sounds like the right approach, to have language grammars maintained in a single place, so that other tools can then reuse them instead of each defining their own.

My question is: how to tokenize a code string using a TextMate grammar in Node? My goal is to have something like:

const codeSnippet = `
class Foo
  def bar
    puts "baz"
  end
end
`

const tokenized = tokenizeCode(codeSnippet, 'ruby')

tokenized // some kind of array of tokens, e.g:
// [
//   ['keyword', 'class'],
//   ['whitespace', ' '],
//   ['class', 'Foo'],
//   ...
// ]

I've tried vscode-textmate, which is what VSCode seems to use for its own syntax highlighting. However, I couldn't figure out how to use it to achive the functionality above.

Ultimately I want to end up with HTML that I can syntax-highlight:

  
    class Foo

Again, I've tried highlight.js and Prism, but they both incorrectly tokenize even the simplest Ruby code.

Edit

Here are some examples where Prism and Highlight.js incorrectly tokenize Ruby code:

Highlight.js – doesn't tokenize Post as a "constant"

const hljs = require("highlight.js/lib/highlight.js");
hljs.registerLanguage('ruby', require('highlight.js/lib/languages/ruby'));

const rubyCode = `Post.create(params[:post])`
const html = hljs.highlight('ruby', rubyCode).value

console.log(html)
// Post.create(params[:post])

Prism – doesn't tokenize foo: as a "symbol"

const Prism = require('prismjs');
const loadLanguages = require('prismjs/components/');
loadLanguages(['ruby']);

const rubyCode = `{ foo: "bar" }`
const html = Prism.highlight(rubyCode, Prism.languages.ruby, 'ruby')

console.log(html)
// { foo: "bar" }

Janko · Accepted Answer

I've found the Highlights package under the Atom organization, which uses TextMate grammars and produces tokenized markup. It also has a synchronous API, which I need for integrating with Remarkable.

const Highlights = require("highlights")

const highlighter = new Highlights()

const html = highlighter.highlightSync({
  fileContents: 'answer = 42',
  scopeName: 'source.ruby',
})

html //=>
// //   
//     
//       answer 
//       
//         =
//       
//        
//       
//         42
//       
//      
//   
//

Under the hood it uses First Mate for tokenization, which is an alternative to vscode-texmate, but with much easier usage:

const { GrammarRegistry } = require('first-mate')

const registry = new GrammarRegistry()
const grammar = registry.loadGrammarSync('./ruby.cson')

const tokens = grammar.tokenizeLines('answer = 42') // does all the work

tokens[0] //=>
// [ { value: 'answer ', scopes: [ 'source.ruby' ] },
//   { value: '=',
//     scopes: [ 'source.ruby', 'keyword.operator.assignment.ruby' ] },
//   { value: ' ', scopes: [ 'source.ruby' ] },
//   { value: '42',
//     scopes: [ 'source.ruby', 'constant.numeric.ruby' ] } ]

How to tokenize a code snippet using a Textmate grammar in Node

Edit

Answers (2)

Related Questions