smallB
smallB

Reputation: 17138

Counting lines of code

I was doing some research on line counters for C++ projects and I'm very interested in algorithms they use. Does anyone know where can I look at some implementation of such algorithms?

Upvotes: 7

Views: 22672

Answers (4)

stelonix
stelonix

Reputation: 756

There's cloc, which is a free open-source source lines of code counter. It has support for many languages, including C++. I personally use it to get the line count of my projects.

At its sourceforge page you can find the perl source code for download.

Upvotes: 27

Daniel
Daniel

Reputation: 6775

I think part of the reason people are having so much trouble understanding your problem is because "Count the lines of c++" is itself an algorithm. Perhaps what you're trying to ask is "How do I identify a line of c++ in a file?" That is an entirely different question which Kos seems to have done a pretty good job trying to explain.

Upvotes: 2

Kos
Kos

Reputation: 72299

You don't need to actually parse the code to count line numbers, it's enough to tokenise it.

The algorithm could look like:

int lastLine = -1;
int lines = 0;
for each token {
    if (isCode(token) && lastLine != token.line) {
        ++lines; 
        lastLine = token.line;
    }
}

The only information you need to collect during tokenisation is:

  • what type of a token it is (an operator, an identifier, a comment...) You don't need to get very precise here actually, as you only need to distinguish "non-code tokens" (comments) and "code tokens" (anything else)
  • at which line in the file the token occures.

On how to tokenise, that's for you to figure out, but hand-writting a tokeniser for such a simple case shouldn't be hard. You could use flex but that's probably redundant.


EDIT

I've mentioned "tokenisation", let me describe it for you quickly:

Tokenisation is the first stage of compilation. The input of tokenisation is text (multi-line program), and the output is a sequence of "tokens", as in: symbols with some meaning. For instance, the following program:

#include "something.h"

/*
This is my program.
It is quite useless.
*/
int main() {
    return something(2+3); // this is equal to 5
}

could look like:

PreprocessorDirective("include")
StringLiteral("something.h")
PreprocessorDirectiveEnd
MultiLineComment(...)
Keyword(INT)
Identifier("main")
Symbol(LeftParen)
Symbol(RightParen)
Symbol(LeftBrace)
Keyword(RETURN)
Identifier("something")
Symbol(LeftParen)
NumericLiteral(2)
Operator(PLUS)
NumericLiteral(3)
Symbol(RightParen)
Symbol(Semicolon)
SingleLineComment(" this is equal to 5")
Symbol(RightBrace)

Et cetera.

Tokens, depending on their type, may have arbitrary meta-data attached to them (i.e. the symbol type, the operator type, the identifier text, or perhaps the number of the line where the token was found).

Such stream of tokens is then fed to the parser, which uses grammar production rules written in terms of these tokens, for instance, to build a syntax tree.

Doing a full parser that would give you a complete syntax tree of code is challenging, and especially challenging if it's C++ we're talking about. However, tokenising (or "lexing" or "lexical analysis") is easier, esp. when you're not concerned about much details, and you should be able to write a tokeniser yourself using a Finite state machine.

On how to actually use the output to count lines of code (i.e. lines in which at least "code" token, i.e. any token except comment, starts) - see the algorithm I've described earlier.

Upvotes: 3

James Kanze
James Kanze

Reputation: 154027

Well, if by line counters, you mean programs which count lines, then the algorithm is pretty trivial: just count the number of '\n' in the code. If, on the other hand, you mean programs which count C++ statements, or produce other metrics... Although not 100% accurate, I've gotten pretty good results in the past just by counting '}' and ';' (ignoring those in comments and string and character literals, of course). Anything more accurate would probably require parsing the actual C++.

Upvotes: 4

Related Questions