Reputation: 721

In what order does Bash parser escape characters and split words/tokens within command line?

I am trying to decidedly understand Bash parser’s order of business.

This wiki page claims the following order:

Read line.

Process/remove quotes.

Split on semicolons.

Process 'special operators', which according to the article are:

Command groupings and brace expansions, e.g. {…}.

Process substitutions, e.g. cmd1 <(cmd2).

Redirections.

Pipelines.

Perform expansions, which are not all listed, but should include:

Brace expansion, e.g. {1..3}. For some reason the article tucks this into previous stage.

Tilde expansion, e.g. ~root.

Parameter & variable expansion, e.g. ${var##*/}.

Arithmetic expansion, e.g. $((1+12)).

Command substitution, e.g. $(date).

Word splitting, that applies to the results of the expansions; uses $IFS.

Pathname expansion, or globbing, e.g. ls ?d*.

Word splitting, that applies to the whole line; does not use $IFS.

Execution.

This is not a quote, but paraphrased contents of the linked article.

Furthermore there are Bash man pages, and this SO answer claiming to be based on those pages. According to the answer, stages of command parsing are as follows:

initial word splitting

brace expansion

tilde expansion

parameter, variable and arithmetic expansion

command substitution

secondary word splitting

path expansion (aka globbing)

quote removal

Emphasis mine.

I am assuming, that by “initial word splitting” the author means splitting of the entire line, and by “secondary word splitting” they mean splitting of the results of the expansions. This would entail that there exist at least two distinct processes of tokenization during command parsing.

Considering the ordering contradictions between two sources, what is the actual order in which the input command line is being de-quoted and split into words/tokens, relative to the other operations being performed?

EDIT NOTE:

To explain part of the answers, earlier version of this question had a sub-question:

Why does cmd='var=foo';$cmd produce bash: var=foo: command not found?

Upvotes: 1

Answers (3)

CBlew

Reputation: 721

I agree, that my question was asking for a lot, and I deeply appreciate all valuable input. My gratitude to @rici and @CharlesDuffy.

Below is the rough outline of how Bash interprets and executes code.

Stage 1: Line feed

Shell reads input in terms of lines.

Stage 2: Tokenization

Line is chopped into tokens — words and operators, delimited by metacharacters. Quoting (\, '…', "…") is respected, aliases are substituted, comments are removed. Token boundaries are recorded internally.

Metacharacters are: <space>, <tab>, <newline>, |, &, ;, (, ), <, >.

Stage 3: Command parsing

Line is parsed for pipelines, lists, and compound commands (loops, conditionals, groupings). This gives Bash the idea of the ordering in which it will carry out sub-commands. Each sub-command is then processed individually by its own parsing cycle.

Stage 4: Grammar

Assignments (those to the left of command name) and redirections are removed and saved for later.

Stage 5: Expansions

Expansions are performed, in order:

Brace expansion, e.g. {1..3}.
Tilde expansion, e.g. ~root.
Parameter & variable expansion, e.g. ${var##*/}.
Arithmetic expansion, e.g. $((1+12)).
Command substitution, e.g. $(date).
Process substitution, where supported, e.g. cat <(ls).
Word splitting, applies to the unquoted results of the expansions, uses IFS variable for delimiters.
Filename expansion, or globbing, e.g. ls ?d*.
Quote removal: all unquoted \, ‘, and ", not resulting from expansions, are purged.

Stage 6: Redirections

Redirections are performed now, then removed. Previous redirections from pipelines may be overridden.

If the line contains no command name, redirections affect nothing; otherwise they affect only said command.

Stage 7: Assignments

Assignments are performed now, then removed. Their values (to the right of =) undergo:

tilde expansion,
parameter expansion,
command substitution,
arithmetic expansion,
quote removal.

If the line contains no command name, assignments affect current shell environment; otherwise they exist only for said command.

Stage 8: Command and arguments

At this point, if no command name results, the command exits.

Otherwise, the first word of the line becomes the command, the following words — arguments.

Stage 9: Execution

Now, to answer my question.

As follows from the above:

Tokenization occurs in stage 2; word splitting occurs in stages 5 and 7. The two are different concepts.
Quotes (and backslashes) come into play in stage 2, and are generally removed in stage 5. For assignments, they live until stage 7.
Assignments are recognized in stage 4, so they can’t come from variable expansion, which occurs in stage 5.

Upvotes: 1

rici

Reputation: 241861

Posix sets out a precise procedure for shell interpretation. However, most shells -- including bash -- add their own syntax extensions. Also, the standard doesn't insist that it's algorithm actually be used; just that the end result is the same. So there are some differences between the standard algorithm and descriptions concerning individual shells. Nonetheless, the broad outline is the same.

It is important to understand the differences between tokenisation and word-splitting. Tokenisation divides the input into syntactically significant tokens, which are then used by the shell grammar to syntactically analyse the input. Syntactic tokens include things like semicolons and parentheses ("operators" in the terminology of the standard). One particular type of token is a WORD.

Tokenisation is, as noted by the standard, basically the first step in parsing the input (but, as noted below, it depends on the identification of quoted characters.)

WORDs may be subsequently interpreted by applying various expansions. The precise set of expansions applied to each word depends on the grammatical context; not all words are treated the same. This is documented in the narrative text of the standard. One transformation which is applied to some WORDs is word-splitting, which splits one WORD into a list of WORDs based on the presence of field-separator characters, by default whitespace (and configurable by changing the value of the IFS shell variable). Word-splitting does not change the syntactic token type; indeed, by the time it happens, syntactic analysis is complete.

Not all WORDs are subject to word-splitting. In particular, word-splitting is not performed unless there was some expansion, and then only if the expansion was not inside double quotes. (And even then, not in all syntactic contexts.)

The algorithm for dividing the input into tokens must be equivalent to that in the standard. This algorithm requires that it be known which characters have been quoted; most historical implementations do that by internally flagging each input character with a "quoted" bit. Whether or not the quoting characters are removed during tokenisation is somewhat implementation-dependent; the standard puts the quote removal step at the end but an implementation could do it earlier if the end result is identical.

Note that = is not an operator character, so it does not cause var=foo to be split into multiple tokens. However, tokens which start with an identifier followed by = are treated specially by the shell parser; they are later treated as parameter assignments. But, as mentioned above, word-splitting does not change the syntactic nature of a WORD, so WORDs resulting from word-splitting which happen to look like parameter assignments are not treated as such by the shell parser.

Upvotes: 2

Charles Duffy

Reputation: 295650

The very first step in shell parsing is applying shell grammar rules which are obligated to provide a superset of the syntax specified in the POSIX shell command language grammar specification.

It's only in this initial stage where assignments can be detected, and only under very specific circumstances:

The ASSIGNMENT_WORD token must be produced by the parser (note that the parser runs only once, and does not rerun after any expansions have taken place!)
The = character itself, and the valid variable name preceding it, must not be quoted.

The parser is never rerun on expansion results without an explicit invocation of eval (or passing the results to another shell as code, or taking some comparable explicit action), so the results of an expansion will never generate an assignment if the operation did not parse as an assignment prior to that expansion taking place.