Reputation: 6042
I am curious if there are any extensive overview, preferrably specifications / technical reports about the GNU style and other commonly used styles for parsing Command Line Arguments.
As far as I know, there are many catches and it's not completely trivial to write a parsing library that would be as compliant as, for example, C++ boost::program_options, Python's argparse, GNU getopt and more.
On the other hand, there might be libraries that are too liberal in accepting certain options or too restrictive. So, if one wants to aim for a good compatibility / conformance with a de-facto standard (if such exists), is there a better way than simply reading a number of mature libraries' source code and/or test cases?
Upvotes: 0
Views: 436
Reputation: 241771
Posix provides guidelines for the syntax of utilities, as Chapter 12 of XBD (the Base Definitions). It's certainly worth a read. As is noted, backwards-compatibility has meant that many standardized utilities do not conform to these guidelines, but nonetheless the standard recommends
... that all future utilities and applications use these guidelines to enhance user portability. The fact that some historical utilities could not be changed (to avoid breaking existing applications) should not deter this future goal.
You can also read the rationale for the syntax guidelines.
Posix provides a basic syntax but it's insufficient for utilities with a large number of arguments, and single-letter options are somewhat lacking in self-documentation. Some utilities -- test
, find
and tcpdump
spring to mind -- essentially implement domain specific languages. Others -- ls
and ps
, for example -- have a bewildering pantheon of invocation options. To say nothing of compilers...
Over the years, a number of possible extension methods have been considered, and probably all of the are still in use in at least one common (possibly even standard) utility. Posix recommends the use of -W
as an extension mechanism, but there are few uses of that. X Windows and TCL/Tk popularized the use of spelled-out multicharacter options, but those utilities expect long option names to still start with a single dash, which renders it impossible to condense non-argument options [Note 1]. Other utilities -- dd
, make
and awk
, to name a few -- special-case arguments which have the form {íd}={val}
with no hyphens at all. The GNU approach of using a double-hyphen seems to have largely won, partly for this reason, but GNU-style option reordering is not universally appreciated.
A brief discussion of GNU style is found in the GNU style guide (see also the list of long options), and a slightly less brief discussion is in Eric Raymond's The Art of Unix Programming [Note 2].
Google code takes command-line options to a new level; the internal library has now been open-sourced as gflags so I suppose it is now not breaking confidentiality to observe how much of Google's server management tooling is done through command-line options. Google flags are scattered indiscriminately throughout the code, so that library functions can define their own options without the calling program ever being aware of them, making it possible to tailor the behaviour of key libraries independently of the application. (It's also possible to modify the value of a gflag on the fly at runtime, another interesting tool for service management.) From a syntactic viewpoint, gflags
allows both single- and double-hyphen long option presentation, indiscriminately, and it doesn't allow coalesced single-character-option calls. [Note 3]
It's worth highlighting the observation in The Unix Programming Environment (Kernighan & Pike) that because the shell "must satisfy both the interactive and programming aspects of command execution, it is a strange language, shaped as much by history as by design." The requirements of these two aspects -- the desire of a concise interactive language and a precise programming language -- are not always compatible.
Syntax flexibility, while handy for the interactive user, can be disastrous for the script author. As an example, last night I typed -env=...
instead of --env=...
which resulted in my passing nv=...
to the -e
option rather than passing ...
to the --env
option, which I didn't notice until someone asked me why I was passing that odd string as an EOF indicator. On the other hand, my pet bugbear -- the fact that some prefer --long-option
and others prefer --long_option
and sometimes you find both styles in the same program (I'm looking at you, gcc) -- is equally annoying as an interactive user and as a scripter.
Sadly, I don't know of any resource which would serve as an answer to this question, and I'm not sure that the above serves the need either. But perhaps we can improve it over time.
Notes:
Obviously a bad idea, since it would make impossible the pastime of constructing useful netstat
invocations whose argument is a readable word.
The book and its author are commonly known as TAOUP and ESR, respectively.
It took me a while to get used to this, and very little time to revert to my old habits. So you can see where my biases lie.
Upvotes: 2