Reputation: 2177
What characters are allowed in common lisp symbols? Can you give a regular expression to match them (or are they beyond the capable of regular grammars to describe)?
I have tried looking for information on this, but all I can find are some examples in CLHS, but no concrete definition of what exactly a legal symbol is.
So, common lisp symbols can legally contain any character.
However, the parser doesn't just accept any character as it reads lisp code. What are the rules for parsable symbols? E.g. symbols that can be supplied as 'quoted
symbols or inside of '(quoted lists)
.
I am interested in generating and reading non-bar-delimited symbols, from a non-lisp language. It should suffice, for my application, to use [a-zA-Z0-9:&-]+
, but I tend to prefer to be as accurate as possible, which is why I am trying to determine if there is a regex that can match symbols. Matching the |delimited syntax|
would be a bonus, but non-delimited symbols would suffice.
This needs to be symbols that would be loaded legally when using (read)
. The answer is not that symbols can contain any character:
[1]> (read t)
#
*** - READ from #<IO TERMINAL-STREAM>: objects printed as # in view of *PRINT-LEVEL* cannot be read back in
I want to know the rules, or a regex, for what is a valid symbol here, without delimiting it with |
.
Upvotes: 2
Views: 2090
Reputation: 85853
As sds mentioned, symbol names can contain any characters. Given any string, you can create a symbol with that name. However, based on your comments, it sounds like you're wonder what, under fairly default settings, will be read as a symbol. The answer is still "pretty much anything", with a few exceptions.
The relevant sections in the HyperSpec begin with 2.2 Reader Algorithm, which describes the tokenization process. It describes the process in detail, but perhaps the most important part is:
When dealing with tokens, the reader's basic function is to distinguish representations of symbols from those of numbers. When a token is accumulated, it is assumed to represent a number if it satisfies the syntax for numbers listed in Figure 2-9. If it does not represent a number, it is then assumed to be a potential number if it satisfies the rules governing the syntax for a potential number. If a valid token is neither a representation of a number nor a potential number, it represents a symbol.
The Figure 2.9 mentioned in that except is in section 2.3.1 Numbers as Tokens, which says:
When a token is read, it is interpreted as a number or symbol. The token is interpreted as a number if it satisfies the syntax for numbers specified in the next figure.
So, the process is really "tokenize the stream, and for each token, check if it's a number, and if it's not a number, then it's a symbol." I realize this doesn't provide an a nice clean grammar for symbols, but that's just the way the language is defined. If you sit down to the task of writing a tokenizer and reader for a Lisp, you may find that this is a pretty convenient way of going about it. You pretty much just need to recognize which characters terminate a symbol, which characters start and end lists, what gets eliminated as whitespace, and what your escape characters are. Then you read nested lists of tokens, turning each token into a number or a symbol (or a string, etc.).
Perhaps one of the easiest ways to see why you have to do this in terms of tokenization and then checking for numbers is the fact that Common Lisp has a *read-base*variable that controls the base. Depending on the value of *read-base*, some things are numbers or symbols, and you can't know until you know what the complete token is, and what the current state of the runtime is.
CL-USER> 'beef
BEEF
CL-USER> (setf *read-base* 16)
16
CL-USER> 'beef
48879
CL-USER> (setf *read-base* a) ; set it back to 10, which is now a
10
CL-USER> (setf *read-base* 36)
36
CL-USER> 'hello ; a number
29234652
CL-USER> 'hello\ world ; a symbol
|HELLO WORLD|
Upvotes: 4
Reputation: 60014
Any character can be in a symbol. E.g.:
(length (loop for i to char-code-limit
collect (intern (string (code-char i)))))
==> 1114113
Upvotes: 1