elizabeth
elizabeth

Reputation: 71

What does Tokens do and why they need to be created in C++ programming?

I am reading a book (Programming Principles and Practice by Bjarne Stroustrup).

In which he introduce Tokens:

“A token is a sequence of characters that represents something we consider a unit, such as a number or an operator. That’s the way a C++ compiler deals with its source. Actually, “tokenizing” in some form or another is the way most analysis of text starts.”

class Token {
public:
    char kind;
    double value;
};

I do get what they are but he never explains this in detail and its quite confusing to me.

Upvotes: 6

Views: 3098

Answers (4)

farhan
farhan

Reputation: 469

As mentioned by others Bjrane is referring to Lexical analysis.

In general terms tokenizing || creating tokens, is a process of processing input streams and dividing them into blocks, without worrying about whitespaces etc. best described earlier by @StoryTeller. "or as bjrane said: is a sequence of characters that represent something we consider a unit".

The token itself is an example of a C++ user-defined type'UDT' like int or char, so token can be used to define variables and hold values.

UDT can have member functions as well as data members. In your code you define two member functions which is very basic.

1)Kind, 2)Value

class Token {
public:
   char kind;
   double value;
};

Based on it we can initialize or construct its objects.

Token token_kind_one{'+'};

Initializing token_kind_one with its kind(operator) '+'.

Token token_kind_two{'8',3.14};

and token_kind_two with its kind(integer/number) '8' and with a value of 3.14.

Lets assume we have an expression of ten characters 1+2*3(5/4), which translates to ten tokens.

Tokens:

        |----------------------|---------------------|
Kind    |'8' |'+' |'8' |'*'|'8'|'('|'8' |'/'|'8' |')'|
        |----------------------|---------------------|
Value   | 1  |    | 2  |   | 3 |   | 5  |   | 4  |   |
        |----------------------|---------------------|

C++ compiler transfer file data to a token sequence skipping all whitespaces. To make it understandable to itself.

Upvotes: 3

Tokenizing is important to the process of figuring out what a program does. What Bjarne is referring to in relation to C++ source deals with how a programs meaning is affected by the tokenization rules. In particular, we must know what the tokens are, and how they are determined. Specifically, how can we identify a single token when it appears next to other characters, and how should we delimit tokens if there is ambiguity.

For instance, consider the prefix operators ++ and +. Let's assume we only had one token + to work with. What is the meaning of the following snippet?

int i = 1;
++i;

With + only, is the above going to just apply unary + on i twice? Or is it going to increment it once? It's ambiguous, naturally. We need an additional token, and therefore introduce ++ as it's own "word" in the language.

But now there is another (albeit smaller) problem. What if the programmer wants to just apply unary + twice, and not increment? Token processing rules are needed. So if we determine that a white space is always a separator for tokens, our programmer may write:

int i = 1;
+ +i;

Roughly speaking, a C++ implementation starts with a file full of characters, transforms them initially to a sequence of tokens ("words" with meaning in the C++ language), and then checks if the tokens appear in a "sentence" that has some valid meaning.

Upvotes: 7

freakish
freakish

Reputation: 56587

He's refering to the lexical analysis - the necessary piece of every compiler. It is a tool for the compiler to treat a text (as in: a sequence of bytes) in a meaningful way. For example consider the following line in C++

double x  = (15*3.0);  // my variable

when the compiler looks at the text it first splits the line into a sequence of tokens which may look like this:

Token {"identifier", "double"}
Token {"space", " "}
Token {"identifier", "x"}
Token {"space", "  "}
Token {"operator", "="}
Token {"space", " "}
Token {"separator", "("}
Token {"literal_integer", "15"}
Token {"operator", "*"}
Token {"literal_float", "3.0"}
Token {"separator", ")"}
Token {"separator", ";"}
Token {"space", "  "}
Token {"comment", "// my variable"}
Token {"end_of_line"}

It doesn't have to be interpreted like above (note that in my case both kind and value are strings), its just an example how it can be done. You usually do this via some regular expressions.

Anyway tokens are easier to understand for the machine then a raw text. Next step for the compiler is to create so called abstract syntax tree based on the tokenization and finally add meaning to everything.

Also note that unless you are writing a parser it is unlikely you will ever use the concept.

Upvotes: 6

SCCC
SCCC

Reputation: 351

Broadly speaking, a compiler will run multiple operations on a given source code, before converting it into a binary format. One of the first stages is running a tokenizer, where the contents of a source file are converted to Tokens, which are units understood by the compiler. For example, if you write a statementint a, the tokenizer might create a structure to store this information.

Type: integer
Identifier: A
Reserved Word: No
Line number: 10

This would be then referred to as a token, and most of the code in a source file will be broken down into similar structures.

Upvotes: -1

Related Questions