Reputation: 721
I am trying to decidedly understand Bash parser’s order of business.
This wiki page claims the following order:
- Read line.
- Process/remove quotes.
- Split on semicolons.
- Process 'special operators', which according to the article are:
- Command groupings and brace expansions, e.g.
{…}
.- Process substitutions, e.g.
cmd1 <(cmd2)
.- Redirections.
- Pipelines.
- Perform expansions, which are not all listed, but should include:
- Brace expansion, e.g.
{1..3}
. For some reason the article tucks this into previous stage.- Tilde expansion, e.g.
~root
.- Parameter & variable expansion, e.g.
${var##*/}
.- Arithmetic expansion, e.g.
$((1+12))
.- Command substitution, e.g.
$(date)
.- Word splitting, that applies to the results of the expansions; uses
$IFS
.- Pathname expansion, or globbing, e.g.
ls ?d*
.- Word splitting, that applies to the whole line; does not use
$IFS
.- Execution.
This is not a quote, but paraphrased contents of the linked article.
Furthermore there are Bash man pages, and this SO answer claiming to be based on those pages. According to the answer, stages of command parsing are as follows:
- initial word splitting
- brace expansion
- tilde expansion
- parameter, variable and arithmetic expansion
- command substitution
- secondary word splitting
- path expansion (aka globbing)
- quote removal
Emphasis mine.
I am assuming, that by “initial word splitting” the author means splitting of the entire line, and by “secondary word splitting” they mean splitting of the results of the expansions. This would entail that there exist at least two distinct processes of tokenization during command parsing.
Considering the ordering contradictions between two sources, what is the actual order in which the input command line is being de-quoted and split into words/tokens, relative to the other operations being performed?
EDIT NOTE:
To explain part of the answers, earlier version of this question had a sub-question:
Why does
cmd='var=foo';$cmd
producebash: var=foo: command not found
?
Upvotes: 1
Views: 2715
Reputation: 721
I agree, that my question was asking for a lot, and I deeply appreciate all valuable input. My gratitude to @rici and @CharlesDuffy.
Below is the rough outline of how Bash interprets and executes code.
Shell reads input in terms of lines.
Line is chopped into tokens — words and operators, delimited by metacharacters. Quoting (\
, '…'
, "…"
) is respected, aliases are substituted, comments are removed. Token boundaries are recorded internally.
Metacharacters are: <space>
, <tab>
, <newline>
, |
, &
, ;
, (
, )
, <
, >
.
Line is parsed for pipelines, lists, and compound commands (loops, conditionals, groupings). This gives Bash the idea of the ordering in which it will carry out sub-commands. Each sub-command is then processed individually by its own parsing cycle.
Assignments (those to the left of command name) and redirections are removed and saved for later.
Expansions are performed, in order:
{1..3}
.~root
.${var##*/}
.$((1+12))
.$(date)
.cat <(ls)
.IFS
variable for delimiters.ls ?d*
.\
, ‘
, and "
, not resulting from expansions, are purged.Redirections are performed now, then removed. Previous redirections from pipelines may be overridden.
If the line contains no command name, redirections affect nothing; otherwise they affect only said command.
Assignments are performed now, then removed. Their values (to the right of =
) undergo:
If the line contains no command name, assignments affect current shell environment; otherwise they exist only for said command.
At this point, if no command name results, the command exits.
Otherwise, the first word of the line becomes the command, the following words — arguments.
Now, to answer my question.
As follows from the above:
Upvotes: 1
Reputation: 241861
Posix sets out a precise procedure for shell interpretation. However, most shells -- including bash -- add their own syntax extensions. Also, the standard doesn't insist that it's algorithm actually be used; just that the end result is the same. So there are some differences between the standard algorithm and descriptions concerning individual shells. Nonetheless, the broad outline is the same.
It is important to understand the differences between tokenisation and word-splitting. Tokenisation divides the input into syntactically significant tokens, which are then used by the shell grammar to syntactically analyse the input. Syntactic tokens include things like semicolons and parentheses ("operators" in the terminology of the standard). One particular type of token is a WORD.
Tokenisation is, as noted by the standard, basically the first step in parsing the input (but, as noted below, it depends on the identification of quoted characters.)
WORDs may be subsequently interpreted by applying various expansions. The precise set of expansions applied to each word depends on the grammatical context; not all words are treated the same. This is documented in the narrative text of the standard. One transformation which is applied to some WORDs is word-splitting, which splits one WORD into a list of WORDs based on the presence of field-separator characters, by default whitespace (and configurable by changing the value of the IFS
shell variable). Word-splitting does not change the syntactic token type; indeed, by the time it happens, syntactic analysis is complete.
Not all WORDs are subject to word-splitting. In particular, word-splitting is not performed unless there was some expansion, and then only if the expansion was not inside double quotes. (And even then, not in all syntactic contexts.)
The algorithm for dividing the input into tokens must be equivalent to that in the standard. This algorithm requires that it be known which characters have been quoted; most historical implementations do that by internally flagging each input character with a "quoted" bit. Whether or not the quoting characters are removed during tokenisation is somewhat implementation-dependent; the standard puts the quote removal step at the end but an implementation could do it earlier if the end result is identical.
Note that =
is not an operator character, so it does not cause var=foo
to be split into multiple tokens. However, tokens which start with an identifier followed by =
are treated specially by the shell parser; they are later treated as parameter assignments. But, as mentioned above, word-splitting does not change the syntactic nature of a WORD, so WORDs resulting from word-splitting which happen to look like parameter assignments are not treated as such by the shell parser.
Upvotes: 2
Reputation: 295650
The very first step in shell parsing is applying shell grammar rules which are obligated to provide a superset of the syntax specified in the POSIX shell command language grammar specification.
It's only in this initial stage where assignments can be detected, and only under very specific circumstances:
ASSIGNMENT_WORD
token must be produced by the parser (note that the parser runs only once, and does not rerun after any expansions have taken place!)=
character itself, and the valid variable name preceding it, must not be quoted.The parser is never rerun on expansion results without an explicit invocation of eval
(or passing the results to another shell as code, or taking some comparable explicit action), so the results of an expansion will never generate an assignment if the operation did not parse as an assignment prior to that expansion taking place.
Upvotes: 2