Reputation: 363
NOTE : I'm using C++14 flag to compile... I am trying to create a very simple lexer in C++. I am using regular expressions to identify different tokens . My program is able to identify the tokens and display them. BUT THE out is of the form
int
main
hello
2
*
3
+
return
I want the output to be in the form
int IDENTIFIER
hello IDENTIFIER
* OPERATOR
3 NUMBER
so on...........
I am not able to achieve the above output.
Here is my program:
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <map>
using namespace std;
int main()
{
string str = " hello how are 2 * 3 you? 123 4567867*98";
// define list of token patterns
map<string, string> v
{
{"[0-9]+" , "NUMBERS"} ,
{"[a-z]+" , "IDENTIFIERS"},
{"[\\*|\\+", "OPERATORS"}
};
// build the final regex
string reg = "";
for(auto it = v.begin(); it != v.end(); it++)
reg = reg + it->first + "|";
// remove extra trailing "|" from above instance of reg..
reg.pop_back();
cout << reg << endl;
regex re(reg);
auto words_begin = sregex_iterator(str.begin(), str.end(), re);
auto words_end = sregex_iterator();
for(sregex_iterator i = words_begin; i != words_end; i++)
{
smatch match = *i;
string match_str = match.str();
cout << match_str << "\t" << endl;
}
return 0;
}
what is the most optimal way of doing it and also maintain the order of tokens as they appear in the source program?
Upvotes: 3
Views: 14155
Reputation: 19607
I managed to do this with only one iteration over the parsed string. All you have to do is add parentheses around regex for each token type, then you'll be able to access the strings of these submatches. If you get a non-empty string for a submatch, that means it was matched. You know the index of the submatch and therefore the index in v
.
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <vector>
int main()
{
std::string str = " hello how are 2 * 3 you? 123 4567867*98";
// use std::vector instead, we need to have it in this order
std::vector<std::pair<std::string, std::string>> v
{
{"[0-9]+" , "NUMBERS"} ,
{"[a-z]+" , "IDENTIFIERS"},
{"\\*|\\+", "OPERATORS"}
};
std::string reg;
for(auto const& x : v)
reg += "(" + x.first + ")|"; // parenthesize the submatches
reg.pop_back();
std::cout << reg << std::endl;
std::regex re(reg, std::regex::extended); // std::regex::extended for longest match
auto words_begin = std::sregex_iterator(str.begin(), str.end(), re);
auto words_end = std::sregex_iterator();
for(auto it = words_begin; it != words_end; ++it)
{
size_t index = 0;
for( ; index < it->size(); ++index)
if(!it->str(index + 1).empty()) // determine which submatch was matched
break;
std::cout << it->str() << "\t" << v[index].second << std::endl;
}
return 0;
}
std::regex re(reg, std::regex::extended);
is for matching for the longest string which is necessary for a lexical analyzer. Otherwise it might identify while1213
as while
and number 1213
and depends on the order you define for the regex.
Upvotes: 4
Reputation: 7943
This is a quick and dirty solution iterating on each pattern, and for each pattern trying to match the entire string, then iterating over matches and storing each match with its position in a map. The map implicitly sorts the matches by key (position) for you, so then you just need to iterate the map to get the matches in positional order, regardless of their pattern name.
#include <iterator>
#include <iostream>
#include <string>
#include <regex>
#include <list>
#include <map>
using namespace std;
int main(){
string str = " hello how are 2 * 3 you? 123 4567867*98";
// define list of patterns
map<string,string> patterns {
{ "[0-9]+" , "NUMBERS" },
{ "[a-z]+" , "IDENTIFIERS" },
{ "\\*|\\+", "OPERATORS" }
};
// storage for results
map< size_t, pair<string,string> > matches;
for ( auto pat = patterns.begin(); pat != patterns.end(); ++pat )
{
regex r(pat->first);
auto words_begin = sregex_iterator( str.begin(), str.end(), r );
auto words_end = sregex_iterator();
for ( auto it = words_begin; it != words_end; ++it )
matches[ it->position() ] = make_pair( it->str(), pat->second );
}
for ( auto match = matches.begin(); match != matches.end(); ++match )
cout<< match->second.first << " " << match->second.second << endl;
}
Output:
hello IDENTIFIERS
how IDENTIFIERS
are IDENTIFIERS
2 NUMBERS
* OPERATORS
3 NUMBERS
you IDENTIFIERS
123 NUMBERS
4567867 NUMBERS
* OPERATORS
98 NUMBERS
Upvotes: 3