Reputation: 363

Writing a very simple lexical analyser in C++

NOTE : I'm using C++14 flag to compile... I am trying to create a very simple lexer in C++. I am using regular expressions to identify different tokens . My program is able to identify the tokens and display them. BUT THE out is of the form

int 
main
hello 
2
*
3
+
return

I want the output to be in the form

int IDENTIFIER
hello IDENTIFIER
* OPERATOR 
3 NUMBER 
so on...........

I am not able to achieve the above output.

Here is my program:

#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <map>

using namespace std;

int main()
{
    string str = " hello how are 2 * 3 you? 123 4567867*98";

    // define list of token patterns
    map<string, string> v
    {
        {"[0-9]+" ,  "NUMBERS"} ,
        {"[a-z]+" ,  "IDENTIFIERS"},
        {"[\\*|\\+",   "OPERATORS"}
    };

    // build the final regex
    string reg = "";
    for(auto it = v.begin(); it != v.end(); it++)
        reg = reg + it->first + "|";

    // remove extra trailing "|" from above instance of reg..
    reg.pop_back();
    cout << reg << endl;

    regex re(reg);
    auto words_begin = sregex_iterator(str.begin(), str.end(), re);
    auto words_end = sregex_iterator();

    for(sregex_iterator i = words_begin; i != words_end; i++)
    {
        smatch match = *i;
        string match_str = match.str();
        cout << match_str << "\t" << endl;
    }

    return 0;
}

what is the most optimal way of doing it and also maintain the order of tokens as they appear in the source program?

Upvotes: 3

Answers (2)

LogicStuff

Reputation: 19607

I managed to do this with only one iteration over the parsed string. All you have to do is add parentheses around regex for each token type, then you'll be able to access the strings of these submatches. If you get a non-empty string for a submatch, that means it was matched. You know the index of the submatch and therefore the index in v.

#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <vector>

int main()
{
    std::string str = " hello how are 2 * 3 you? 123 4567867*98";

    // use std::vector instead, we need to have it in this order
    std::vector<std::pair<std::string, std::string>> v
    {
        {"[0-9]+" , "NUMBERS"} ,
        {"[a-z]+" , "IDENTIFIERS"},
        {"\\*|\\+", "OPERATORS"}
    };

    std::string reg;

    for(auto const& x : v)
        reg += "(" + x.first + ")|"; // parenthesize the submatches

    reg.pop_back();
    std::cout << reg << std::endl;

    std::regex re(reg, std::regex::extended); // std::regex::extended for longest match

    auto words_begin = std::sregex_iterator(str.begin(), str.end(), re);
    auto words_end = std::sregex_iterator();

    for(auto it = words_begin; it != words_end; ++it)
    {
        size_t index = 0;

        for( ; index < it->size(); ++index)
            if(!it->str(index + 1).empty()) // determine which submatch was matched
                break;

        std::cout << it->str() << "\t" << v[index].second << std::endl;
    }

    return 0;
}

std::regex re(reg, std::regex::extended); is for matching for the longest string which is necessary for a lexical analyzer. Otherwise it might identify while1213 as while and number 1213 and depends on the order you define for the regex.

Upvotes: 4

Jonathan H

Reputation: 7943

This is a quick and dirty solution iterating on each pattern, and for each pattern trying to match the entire string, then iterating over matches and storing each match with its position in a map. The map implicitly sorts the matches by key (position) for you, so then you just need to iterate the map to get the matches in positional order, regardless of their pattern name.

#include <iterator>
#include <iostream>
#include <string>
#include <regex>
#include <list>
#include <map>

using namespace std;

int main(){

    string str = " hello how are 2 * 3 you? 123 4567867*98";

    // define list of patterns
    map<string,string> patterns {
        { "[0-9]+" ,   "NUMBERS" },
        { "[a-z]+" ,   "IDENTIFIERS" },
        { "\\*|\\+",  "OPERATORS" }
    };

    // storage for results
    map< size_t, pair<string,string> > matches;

    for ( auto pat = patterns.begin(); pat != patterns.end(); ++pat )
    {
        regex r(pat->first);
        auto words_begin = sregex_iterator( str.begin(), str.end(), r );
        auto words_end   = sregex_iterator();

        for ( auto it = words_begin; it != words_end; ++it )
            matches[ it->position() ] = make_pair( it->str(), pat->second );
    }

    for ( auto match = matches.begin(); match != matches.end(); ++match )
        cout<< match->second.first << " " << match->second.second << endl;
}

Output:

hello IDENTIFIERS
how IDENTIFIERS
are IDENTIFIERS
2 NUMBERS
* OPERATORS
3 NUMBERS
you IDENTIFIERS
123 NUMBERS
4567867 NUMBERS
* OPERATORS
98 NUMBERS

Upvotes: 3

Writing a very simple lexical analyser in C++

Answers (2)

Related Questions