I have a pattern that in the following format: AUTHOR, "TITLE" (PAGES pp.) [CODE STATUS] For example, I have a string P.G. Wodehouse, "Heavy Weather" (336 pp.) [PH.409 AVAILABLE FOR LENDING] I want to extract AUTHOR = P.G. Wodehouse TITLE = Heavy Weather PAGES = 336 CODE = PH.409 STATUS = AVAILABLE FOR LENDING I only know how to do that in Python, however, are there any efficient way to do the same thing in C++?

How to extract the string pattern in C++ efficiently?

Answers (4)

Reputation: 153929

Exactly the same way as in Python. C++11 has regular expressions (and for earlier C++, there's Boost regex.) As for the read loop:

std::string line;
while ( std::getline( file, line ) ) {
    //  ...
}

is almost exactly the same as:

for line in file:
    #    ...

The only differences are:

The C++ version will not put the trailing '\n' in the buffer. (In general, the C++ version may be less flexible with regards to end of line handling.)
In case of a read error, the C++ version will break the loop; the Python version will raise an exception.

Neither should be an issue in your case.

EDIT:

It just occurs to me that while regular expressions in C++ and in Python are very similar, the syntax for using them isn't quite the same. So:

In C++, you'd normally declare an instance of the regular expression before using it; something like Python's re.match( r'...', line ) is theoretically possible, but not very idiomatic (and it would still involve explicitly constructuing a regular expression object in the expression). Also, the match function simply returns a boolean; if you want the captures, you need to define a separate object for them. Typical use would probably be something like:

static std::regex const matcher( "the regular expression" );
std::smatch forCaptures;
if ( std::regex_match( line, forCaptures, matcher ) ) {
    std::string firstCapture = forCaptures[1];
    //  ...
}

This corresponds to the Python:

m = re.match( 'the regular expression', line )
if m:
    firstCapture = m.group(1)
    #   ...

EDIT:

Another answer has suggested overloading operator>>; I heartily concur. Just out of curiousity, I gave it a go; something like the following works well:

struct Book
{
    std::string author;
    std::string title;
    int         pages;
    std::string code;
    std::string status;
};

std::istream&
operator>>( std::istream& source, Book& dest )
{
    std::string line;
    std::getline( source, line );
    if ( source )
    {
        static std::regex const matcher(
            R"^(([^,]*),\s*"([^"]*)"\s*\((\d+) pp.\)\s*\[(\S+)\s*([^\]]*)\])^"
            ); 
        std::smatch capture;
        if ( ! std::regex_match( line, capture, matcher ) ) {
            source.setstate( std::ios_base::failbit );
        } else {
            dest.author = capture[1];
            dest.title  = capture[2];
            dest.pages  = std::stoi( capture[3] );
            dest.code   = capture[4];
            dest.status = capture[5];
        }
    }
    return source;
}

Once you've done this, you can write things like:

std::vector<Book> v( (std::istream_iterator<Book>( inputFile )),
                     (std::istream_iterator<Book>()) );

And load an entire file in the initialization of a vector.

Note the error handling in the operator>>. If a line is misformed, we set failbit; this is the standard convention in C++.

EDIT:

Since there's been so much discussion: the above is fine for small, one time programs, things like school projects, or one time programs which will read the current file, output it in a new format, and then be thrown away. In production code, I would insist on support for comments and empty lines; continuing in case of error, in order to report multiple errors (with line numbers), and probably continuation lines (since titles can get long enough to become unwieldly). It's not practical to do this with operator>>, if for no other reason than the need to output line numbers, so I'd use a parser along the following line:

int
getContinuationLines( std::istream& source, std::string& line )
{
    int results = 0;
    while ( source.peek() == '&' ) {
        std::string more;
        std::getline( source, more );   //  Cannot fail, because of peek
        more[0] = ' ';
        line += more;
        ++ results;
    }
    return results;
}

void
trimComment( std::string& line )
{
    char quoted = '\0';
    std::string::iterator position = line.begin();
    while ( position != line.end() && (quoted != '\0' || *position == '#') ) {
        if ( *position == '\' && std::next( position ) != line.end() ) {
            ++ position;
        } else if ( *position == quoted ) {
            quoted = '\0';
        } else if ( *position == '\"' || *position == '\'' ) {
            quoted = *position;
        }
        ++ position;
    }
    line.erase( position, line.end() );
}

bool
isEmpty( std::string const& line )
{
    return std::all_of(
        line.begin(),
        line.end(),
        []( unsigned char ch ) { return isspace( ch ); } );
}

std::vector<Book>
parseFile( std::istream& source )
{
    std::vector<Book> results;
    int lineNumber = 0;
    std::string line;
    bool errorSeen = false;
    while ( std::getline( source, line ) ) {
        ++ lineNumber;
        int extraLines = getContinuationLines( source, line );
        trimComment( line );
        if ( ! isEmpty( line ) ) {
            static std::regex const matcher(
                R"^(([^,]*),\s*"([^"]*)"\s*\((\d+) pp.\)\s*\[(\S+)\s*([^\]]*)\])^"
                ); 
            std::smatch capture;
            if ( ! std::regex_match( line, capture, matcher ) ) {
                std::cerr << "Format error, line " << lineNumber << std::endl;
                errorSeen = true;
            } else {
                results.emplace_back(
                    capture[1],
                    capture[2],
                    std::stoi( capture[3] ),
                    capture[4],
                    capture[5] );
            }
        }
        lineNumber += extraLines;
    }
    if ( errorSeen ) {
        results.clear();    //  Or more likely, throw some sort of exception.
    }
    return results;
}

The real issue here is how you report the error to the caller; I suspect that in most cases, and exception would be appropriate, but depending on the use case, other alternatives may be valid as well. In this example, I just return an empty vector. (The interaction between comments and continuation lines probably needs to be better defined as well, with modifications according to how it has been defined.)

Upvotes: 7

Jonathan Mee

Reputation: 38919

Your input string is well delimited so I'd recommend using an extraction operator over a regex, for speed and for ease of use.

You'd first need to create a struct for your books:

struct book{
    string author;
    string title;
    int pages;
    string code;
    string status;
};

Then you'd need to write the actual extraction operator:

istream& operator>>(istream& lhs, book& rhs){
    lhs >> ws;
    getline(lhs, rhs.author, ',');
    lhs.ignore(numeric_limits<streamsize>::max(), '"');
    getline(lhs, rhs.title, '"');
    lhs.ignore(numeric_limits<streamsize>::max(), '(');
    lhs >> rhs.pages;
    lhs.ignore(numeric_limits<streamsize>::max(), '[');
    lhs >> rhs.code >> ws;
    getline(lhs, rhs.status, ']');
    return lhs;
}

This gives you a tremendous amount of power. For example you can extract all the books from an istream into a vector like this:

istringstream foo("P.G. Wodehouse, \"Heavy Weather\" (336 pp.) [PH.409 AVAILABLE FOR LENDING]\nJohn Bunyan, \"The Pilgrim's Progress\" (336 pp.) [E.1173 CHECKED OUT]");
vector<book> bar{ istream_iterator<book>(foo), istream_iterator<book>() };

Upvotes: 3

David

Reputation: 1680

Here's the code:

#include <iostream>
#include <cstring>

using namespace std;

string extract (string a)
{
    string str = "AUTHOR = "; //the result string
    int i = 0;
    while (a[i] != ',')
        str += a[i++];
    while (a[i++] != '\"');

    str += "\nTITLE = ";
    while (a[i] != '\"')
        str += a[i++];
    while (a[i++] != '(');

    str += "\nPAGES = ";
    while (a[i] != ' ')
        str += a[i++];
    while (a[i++] != '[');

    str += "\nCODE = ";
    while (a[i] != ' ')
        str += a[i++];
    while (a[i++] == ' ');

    str += "\nSTATUS = ";
    while (a[i] != ']')
        str += a[i++];
    return str;
}

int main ()
{
    string a;
    getline (cin, a);
    cout << extract (a) << endl;
    return 0;
}

Happy coding :)

Upvotes: 0

JJoao

Reputation: 5347

Use flex (it generates C or C++ code, to be used as a part or as the full program)

%%
^[^,]+/,          {printf("Autor: %s\n",yytext  );}
\"[^"]+\"         {printf("Title: %s\n",yytext  );}
\([^ ]+/[ ]pp\.   {printf("Pages: %s\n",yytext+1);}
..................
.|\n              {}
%%

(untested)

Upvotes: 2

How to extract the string pattern in C++ efficiently?

Answers (4)

EDIT:

Related Questions