alain.janinm
alain.janinm

Reputation: 20065

Determine Cobol coding style

I'm developing an application that parses Cobol programs. In these programs some respect the traditional coding style (programm text from column 8 to 72), and some are newer and don't follow this style.

In my application I need to determine the coding style in order to know if I should parse content after column 72.

I've been able to determine if the program start at column 1 or 8, but prog that start at column 1 can also follow the rule of comments after column 72.

So I'm trying to find rules that will allow me to determine if texts after column 72 are comments or valid code.

I've find some but it's hard to tell if it will work everytime :

I'd like to know what do you think of these rules and if you have any ideas to help me determine the coding style of a Cobol program.

I don't need an API or something just solid rules that I will be able to rely on.

Upvotes: 2

Views: 962

Answers (4)

Brian Tiffin
Brian Tiffin

Reputation: 4126

Most COBOL compilers will allow you to generate and analyze the post text manipulation phase.

The text preprocessor output can be seen (using OpenCOBOL for the example)

cobc -E program.cob

The text manipulation processor deals with any COPY ... REPLACING compiler directives, as well as converting SOURCE FORMAT IS FIXED (with line continuations, string literal concatenations, comment line removal, among other things) to the actual free format that the compiler lexical analyzer needs. A lot of the OpenCOBOL toolkits (Cross referencer and Animator, to name two) use source code AFTER the preprocessor pass. I don't think you'll lose any street cred if your parser program relies on post processed source code files.

Upvotes: 0

Justin Morgan
Justin Morgan

Reputation: 30700

There won't be an algorithm to do this with 100% certainty, because if comments can be anything, they can also be compilable COBOL code. So you could theoretically write a program that means one thing if the comments are ignored, and something else entirely if the comments are treated as part of the COBOL.

But that's extremely unlikely. What's most likely to happen is that if you try to compile the code under the wrong convention, it will simply fail. So the only accurate way to do this is to try compiling/parsing the program one way, and if you come to a line that can't make sense, switch to the other style. You could also support passing an argument to the compiler when the style is already known.

You can try using heuristics like what you've described, but that will never be totally accurate. The most they can give you is a probability that the code is one or the other style, which will increase as they examine more and more lines of code. They could be useful for helping you guess the style before you start compiling, or for figuring out when the problem is really just a typo in the code.

EDIT:

Regarding ideas for heuristics, it's hard to say. If there were a standard comment sigil like // or # in other languages, this would be a lot easier (actually, there is, but it sounds like your code doesn't follow this convention). The only thing I can think of would be to check whether every line (or maybe 99% of lines, and not counting empty lines or lines commented with *) has a period somewhere before position 72.

One thing you DON'T want to do is apply any heuristics to the part after position 72. That is, you don't want to be checking the comments to see if they're valid COBOL. You want to check what you know is COBOL first, and see if that works by itself. There are several reasons for this:

  • Comments written in English are likely to have periods and quotes in them, so your first and second bullet points are out.
  • Natural languages are WAY harder to parse than something like COBOL.
  • The comments could easily have COBOL in them (maybe someone commented out the previous version of the line).
  • An important rule for comments is that they should never affect what the program does. If changing the comments can change how the program is compiled, you violate that.

All that in mind, my opinion is that you shouldn't use heuristics at all. You should always try to compile the program under both conventions unless one is explicitly specified. There's a chance that code will compile successfully under both conventions, and then you'll have two different programs and no way to tell which one is correct.

If that happens, you need to compare the two results (perhaps with a hash or something) to see if they're the same program. If they're the same, great, but if not, you'll need to force the user to explicitly choose a convention.

Upvotes: 1

NealB
NealB

Reputation: 16928

There is no absolutely reliable way to determine if a COBOL program is in fixed or free format based only on the source code. Heck it is sometimes difficult to identify the programming language based only on source code. Check out this classic polyglot - it is valid under 8 different language compilers. That said, you could try a few heuristics that might yield the correct answer more often than not.

Compiler directives imbedded in source code

Watch for certain compiler directives that determine code format. Unfortunately, every compiler vendor uses their own flavour of directive.

For example, Microfocus COBOL uses the SOURCEFORMAT directive. This directive will appear near the top of the program so a short pre-scan could be used to find it. On the other hand, OpenCobol uses >>SOURCE FORMAT IS FREE and >>SOURCE FORMAT IS FIXED to toggle between free and fixed format, different parts of the same program could be formatted differently!

The bottom line here is that you will have to support the conventions of multiple COBOL compilers.

Compiler switches

Source code format can be also be specified using a compiler switch. In this case, there are no concrete clues to go on. However, you can be reasonably sure that the entire source program will be either fixed or free. All you can do here is guess. Unless the programmer is out to "mess with your head" (and some will), a program in free format will have the keywords IDENTIFICATION DIVISION or ID DIVISION, starting before column 8. Every COBOL program will begin with these keywords so you can use them as the anchor point for determining code format in the absence of imbedded compiler directives.

Warning - this is far from fool proof, but might be a good start.

Upvotes: 1

Ira Baxter
Ira Baxter

Reputation: 95420

I think you need to know the COBOL compiler for each program. Its documentation should tell you what conventions/configurations/switches it uses to decide if the source code ends at column 72 or not.

So.... which compiler(s)?

And if you think the column 72 issue is a pain, wait till you get around to actually parsing the COBOL itself. If you are not well prepared to handle the lexical issues of the language, you are probably very badly prepared to handle the syntactic ones.

Upvotes: 2

Related Questions