Jiminion
Jiminion

Reputation: 5168

Is there a proper defined way to separate command arguments in C?

For example, consider the call:

>routine -h -s name -t "name also" -u 'name as well'

Would this return 8 arguments or more? Is there a defined standard as to how they are parsed? Where would this be located?

NOTE: I am not interested in code to do this, but the rules or standards that apply. I do not consider reading source code somewhere as documentation of the standard, which I assume must reside somewhere.

Upvotes: 0

Views: 456

Answers (3)

rici
rici

Reputation: 241701

A standard command shell is a programming language whose actions are (mostly) invoking "utilities", which are executable programs. The job the shell performs is to set up the standard environment in order to invoke the utility, which includes:

  • Figuring out which executable corresponds to the utility to be invoked;

  • Assigning stdin, stdout and stderr file descriptors for the utility to appropriate streams;

  • Creating the argv argument vector and passing it to the invoked utility;

  • Setting up the utility's environ global, which the utility can access through the getenv standard library function;

Like any programming language, the shell has values, literals, variables, and control flow. It has a syntax (and a very idiosyncratic lexical analysis algorithm). It also has other primitives which are particularly designed for its task.

As an example, /usr/bin and "this is not a sentence" are literal values in the shell language. The quotes around the second of these are not part of the value; they are part of the language's syntax for literal strings. (The shell language allows many literal strings to be written without quotes, and also includes a complicated expression language so that not all double-quoted strings are literals, but in the simple case a quoted string is not conceptually different from a quoted string in C.)

The basic syntax and semantics are standardized by Posix. Many commonly-used shell languages mostly conform to this standard. Almost all provide extensions; some (if not most) are not completely compatible with even the base standard unless specific options are enabled. (For example, for bash, invoking it with the --posix command line argument.) However, the basic principles are generally obeyed, and reading the Posix link above will provide a good overview. It includes a complete grammar.

In general, the procedure is the following:

  • The shell breaks the command line into "words".
  • Some words are "expanded", possibly being replaced by zero or more words.
  • Some words are interpreted as file-descriptor redirections; others as environment variable assignments.
  • If the result is a specific shell syntax, it is executed. Otherwise, the first word is interpreted as either the name of a shell function, a builtin command, or an external utility
  • If the command resolves to an external utility, the words from the command line (other than the ones already used as redirections and assignments) are placed into an argv vector, and the utility is invoked.

It's a lot more complicated than that, but that's the basic model.

Invoking the utility is performed using one of the exec* family of standard library functions, which takes as arguments:

  • The path to an executable
  • A zero-terminated vector of pointers to strings, which will be the argv vector
  • A zero-terminated vector of pointers to strings of the form name=value, which will be the environ global.

The exec call then invokes the external utility. It copies the argument vector and environ list into the utility's address space, but does not otherwise modify or validate the values other than checking that the total size of the two lists does not exceed some system limit.


The rest of this answer pertains to how the utility itself (might or should) parse the argument vector it receives.

There is no standard for interpreting command-line arguments, but there are guidelines and there are standard (and not-so-standard) library routines which impose a kind of de facto standard, which defines what users (might) expect.

To start with, the Posix guidelines are (mostly) implemented by the Posix standard getopt function. These guidelines suggest that optional arguments (those with - flags) precede all positional arguments.

However, not all Posix utilities conform to these suggestions, and it is common to find utilities which "permute" arguments, allowing options to follow positional arguments. This mechanism is (mostly) implemented by the Gnu version of getopt. In addition, Gnu defines (and suggests the use of) the getopt_long function, which allows multicharacter options initiated with --.

In all cases, how optional flag arguments are parsed depends on whether the option is defined as taking an argument or not. So

-s1 word

could be parsed as:

  • If -s takes an argument:
    • option -s with argument "1"
    • positional argument "word"
  • If -s does not take an argument and -1 is a valid flag not taking an argument
    • option -s
    • option -1
    • positional argument "word"
  • If -s does not take an argument and -1 does take an argument:
    • option -s
    • option -1 with argument "word"

In addition to the above, there are also commands which accept "long options" started with a single dash (and thus do not allow short options to be condensed into a single word). This is the style used by TCL, and is followed by many GUI commands. This style can be parsed with the GNU function getopt_long_only (see previous link).

Upvotes: 1

Keith Thompson
Keith Thompson

Reputation: 263247

POSIX defines the getopt() function and the getopts command (typically built into the shell) to parse command-line arguments.

The standard only allows for single-letter option names, so it would not support your example:

routine -h -s1 name -s2 "name also" -s3 "name as well"

NOTE: In your question, you have "name as well' at the end of your command line. This would be rejected by the shell before your routine even sees its arguments, because of the mismatched quotation marks. I'll assume that was just a typo.

It's common for commands to support extended option syntax. GNU tools, for example, commonly support long names for options, introduced by -- rather than -, in addition to the standard single-letter options. The GNU version of the getopt function is documented here.

Upvotes: 0

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798616

The shell in use is responsible for parsing the command line and invoking exec*() appropriately. See the documentation for the specific shell in question to learn about its rules, and see its source code to see how it parses the command line.

Upvotes: 1

Related Questions