Reputation: 3341
I have been assigned to write a compiler for Basic programming language. In basic, codes are separated with new lines or by :
mark. e.g to following to codes are valid.
Model# 1
10 PRINT "Hello World 1" : PRINT "Hello World 2"
Model# 2
10 PRINT "Hello World 1"
20 PRINT "Hello World 2"
You can test those here.
The First thing i need to do, before parsing codes in my compiler is to split codes.
I have already splited codes in lines but i am stucked with finding a regex to split The following code sample:
This following code sample should be splited in 2 PRINT
codes.
10 PRINT "Hello World 1" : PRINT "Hello World 2"
But DO NOT match this:
The following code sample is a single standalone command.
10 PRINT "Hello World 1" ": PRINT Hello World 2"
Any regex pattern to DO match the first of above code samples which :
is outside of pair of "
and DO NOT match the second one?
Can anybody help me out here?
Any thing would help. :)
Upvotes: 4
Views: 430
Reputation: 3341
Thanks to @Mauren I managed to do what i wanted to do.
Here is my code(maybe help someone later):
Note that the source file's content contained in char* buffer
and vector<string> source_code
.
/* lines' tokens container */
std::string token;
/* Tokenize the file's content into seperate lines */
/* fetch and tokenizing line version of readed data and maintain it into the container vector*/
for(int top = 0, bottom = 0; top < strlen(buffer) ; top++)
{
/* inline tokenizing with line breakings */
if(buffer[top] != '\n' || top == bottom)
{ /* collect current line's tokens */ token += char(buffer[top]); /* continue seeking */continue; }
/* if we reach here we have collected the current line's tokens */
/* normalize current tokens */
boost::algorithm::trim(token);
/* concurrent statements check point */
if(token.find(':') != std::string::npos)
{
/* a quotation mark encounter flag */
bool quotation_meet = false;
/* process entire line from beginning */
for(int index = 0; true ; index++)
{
/* loop's exit cond. */
if(!(index < token.length())) { break; }
/* fetch currently processing char */
char _char = token[index];
/* if encountered a quotation mark */
/* we are moving into a string */
/* note that in basic for printing quotation mark, should use `CHR$(34)`
* so there is no `\"` to worry about! :) */
if(_char == '"')
{
/* change quotation meeting flag */
quotation_meet = !quotation_meet;
/* proceed with other chars. */
continue;
}
/* if we have meet the `:` char and also we are not in a pair quotation*/
if(_char == ':' && !quotation_meet)
{
/* this is the first sub-token of current token */
std::string subtoken(token.substr(0, index - 1));
/* normalize the sub-token */
boost::algorithm::trim(subtoken);
/* add sub-token as new line */
source_codes.push_back(subtoken);
/* replace the rest of sub-token as new token */
/**
* Note: We keep the `:` mark intentionally, since every code line in BASIC
* should start with a number; by keeping `:` while processing lines starting with `:` means
* they are meant to execute semi-concurrent with previous numbered statement.
* So we use following `substr` pattern instead of `token.substr(index + 1, token.length() - 1);`
*/
token = token.substr(index, token.length() - 1);
/* normalize the sub-token */
boost::algorithm::trim(token);
/* reset the index for new token */
index = 0;
/* continue with other chars */
continue;
}
}
/* if we have any remained token and not empty one? */
if(token.length())
/* a the tokens into collection */
goto __ADD_TOKEN;
}
__ADD_TOKEN:
/* if the token is not empty? */
if(token.length())
/* add fetched of token to our source code */
source_codes.push_back(token);
__NEXT_TOKEN:
/* move pointer to next tokens' position */
bottom = top + 1;
/* clear the token buffer */
token.clear();
/* a fail safe for loop */
continue;
}
/* We NOW have our source code departed into lines and saved in a vector */
Upvotes: 0
Reputation: 1975
I believe the best option for you is tokenize your source code by using a device such as a loop, instead of trying to tokenize it by using regexps.
In pseudocode
string lexeme;
token t;
for char in string
if char fits current token
lexeme = lexeme + char;
else
t.lexeme = lexeme;
t.type = type;
lexeme = null;
end if
// other treatments here
end for
You can see a real-world implementation of this device in this source code, more specifically at line 86.
Upvotes: 1
Reputation: 89574
The idea to avoid this kind of problem is to match content inside quotes before trying to match colons example:
"(?>[^\\"]++|\\{2}|\\.)*"|:
You can add capturing groups to know which part of the alternation has been matched.
However, the good tool to make this kind of task is probably lex/yacc
Upvotes: 0