Reputation: 11

Extract source comments from a Scala source file

I would like to programmatically extract the code comments from a Scala source file.

I have access to both the source file and objects of the classes whose comments I am interested in. I am also open to writing the comments in the Scala source file in a specific form to facilitate extraction (though still following Scaladoc conventions).

Specifically, I am not looking for HTML or similar output.

However, a json object I can then traverse to get the comments for each field would be perfectly fine (though json is not a requirement). Ideally, I would like to get a class or class member comment given its "fully qualified" name or an object of the class.

How do I best do this? I am hoping for a solution that is maintainable (without too much effort) from Scala 2.11 to Scala 3.

Appreciate all help!

Upvotes: 0

Answers (1)

Algamest

Reputation: 1529

I have access to both the source file

By this I assume you have that path to the file, which I'll represent in my code as:

val pathToFile: String = ???

TL;DR

import scala.io.Source

def comments(pathToFile: String): List[String] = {
  def lines: Iterator[(String, Int)] = Source.fromFile(pathToFile).getLines().zipWithIndex

  val singleLineJavaDocStartAndEnds = lines.filter {
    case (line, lineNumber) => line.contains("/*") && line.contains("*/")
  }.map { case (line, _) => line }

  val javaDocComments = lines.filter {
    case (line, lineNumber) =>
      (line.contains("/*") && !line.contains("*/")) ||
      (!line.contains("/*") && line.contains("*/"))
  }
  .grouped(2).map {
    case Seq((_, firstLineNumber), (_, secondLineNumber)) =>
      lines
        .map { case (line, _) => line }
        .slice(firstLineNumber, secondLineNumber+1)
        .mkString("\n")
  }

  val slashSlashComments = lines
    .filter { case (line, _) => line.contains("//") }
    .map { case (line, _) => line }

  (singleLineJavaDocStartAndEnds ++ javaDocComments ++ slashSlashComments).toList
}

Full explanation

First thing to do is to read the contents of the file:

import scala.io.Source

def lines: Iterator[String]  = Source.fromFile(pathToFile).getLines()

// here we preserve new lines, for Windows you may need to replace "\n" with "\r\n
val content: String = lines.mkString("\n")
// where `content` is the whole file as a `String`

I have made lines a def to prevent unintended results if calling lines multiple times. This is due to the return type of Source.fromFile and how it handles iterating over the file. This comment here adds an explanation. Since you are reading source code files I think rereading the file is a safe operation to perform and won't lead to memory or performance issues.

Now that we have the content of the file we can begin to filter out the lines we don't care about. Another way of viewing the problem is that we only want to keep - filter in - the lines that are comments.

Edit:

As @jwvh rightly pointed out, where I was using .trim.startsWith ignored comments such as:

val x = 1 //mid-code-comments

/*fullLineComment*/

To address this I've replaced .trim.startsWith with .contains.

For single line comments this is simple:

val slashComments: Iterator[String] = lines.filter(line => line.contains("//"))

Notice the call to .trim above which is important as often developers start comments intended to match the indentation of the code. trim removes any whitespace characters at the start of the string. Now using .contains which catches any line with a comment starting anywhere.

Now we'll file multi-line comments, or JavaDoc; for example (the content is not important):

/**
 * Class String is special cased within the Serialization Stream Protocol.
 *
 * A String instance is written into an ObjectOutputStream according to
 * .....
 * .....
 */

The safest thing to do is to fine the lines that the /* and */ appear on and include all of the lines in between:

def lines: Iterator[(String, Int)] = Source.fromFile(pathToFile).getLines().zipWithIndex

val javaDocStartAndEnds: Iterator[(String, Int)] = lines.filter { 
  case (line, lineNumber) => line.contains("/*") || line.contains("*/")
}

.zipWithIndex gives us an incrementing number alongside each line. We can use these to represent the line numbers of the source file. At the moment this will give us a list of lines containing /* and */. We need to group these into groups of 2 as all of these kinds of comments will have a matching pair of /* and */. Once we have these groups we can select, using slice, all of the lines starting from the first index until the last. We want to include the last line so we do a +1 to it.

val javaDocComments = javaDocStartAndEnds.grouped(2).map {
  case Seq((_, firstLineNumber), (_, secondLineNumber)) =>
    lines // re-calling `def lines: Iterator[(String, Int)]`
      .map { case (line, _) => line } // here we only care about the `line`, not the `lineNumber`
      .slice(firstLineNumber, secondLineNumber+1)
      .mkString("\n")
  }

Finally we can combine slashComments and javaDocComments:

val comments: List[String] = (slashComments ++ javaDocComments).toList

Regardless of the order in which we join them they won't appear in an ordered list. An improvement that could be made here would be to preserve lineNumber and order by this at the end.

I will include a "too long; didn't read" (TL;DR) version at the top so anyone can just copy the code in full without the step by step explanation.

How do I best do this? I am hoping for a solution that is maintainable (without too much effort) from Scala 2.11 to Scala 3.

I hope I have answered your question and provided a useful solution. You mentioned a JSON file as output. What I've provided is a List[String] in memory which you can process. If output to JSON is required I can update my answer with this.

Upvotes: 1

Extract source comments from a Scala source file

Answers (1)

TL;DR

Full explanation

Related Questions