zeus_masta_funk
zeus_masta_funk

Reputation: 1498

Quickest Way to parse a string of numbers into a vector of ints

I'm wondering what the quickest way to parse a string of numbers into a vector of ints. My situation is that I will have millions of lines of data, formatted like this:

>Header-name
ID1    1    1   12
ID2    3    6   234
.
.
.
>Header-name
ID1    1    1   12
ID2    3    6   234
.
.
.

I would like to discard the "Header-name" field (or maybe use it for sorting later on), and then ignore the ID field and then place the remaining three ints into a vector. I realize that I could just used boost split and then lexical cast in a couple of for loops with logic to ignore certain data, but I'm not sure if that will give me the quickest solution. I've looked at boost spirit but I don't really understand how to use it. Boost or STL are all ok.

Upvotes: 3

Views: 1737

Answers (4)

sehe
sehe

Reputation: 392954

As always, with delightfully underspecified questions like this, there's not a lot more than just showing "a way" to do "a thing". In this case, I used Boost Spirit (because you mentioned it):

Parsing into flat containers

#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
#include <map>

std::string const input(
    ">Header - name1\n"
    "ID1    1    1   12\n"
    "ID2    3    6   234\n"
    ">Header - name2\n"
    "ID3    3    3   14\n"
    "ID4    5    8   345\n"
);

using Header    = std::string;
using Container = std::vector<int>;
using Data      = std::map<Header, Container>;

int main()
{
    namespace qi = boost::spirit::qi;

    auto f(input.begin()), l(input.end());

    Data data;
    bool ok = qi::phrase_parse(f, l,
        *(
            '>' >> qi::raw[*(qi::char_ - qi::eol)] >> qi::eol
           >> *(!qi::char_('>') >> qi::omit[qi::lexeme[+qi::graph]] >> *qi::int_ >> qi::eol)
        ), qi::blank, data);

    if (ok)
    {
        std::cout << "Parse success\n";
        for (auto const& entry : data)
        {
            std::cout << "Integers read with header '" << entry.first << "':\n";
            for (auto i : entry.second)
                std::cout << i << " ";
            std::cout << "\n";
        }
    }
    else
    {
        std::cout << "Parse failed\n";
    }

    if (f != l)
        std::cout << "Remaining input: '" << std::string(f, l) << "'\n";
}

Prints

Parse success
Integers read with header 'Header - name1':
1 1 12 3 6 234
Integers read with header 'Header - name2':
3 3 14 5 8 345

Parsing into nested containers

Of course, if you wanted separate vectors for each line (don't expect efficiency) then you can simply replace the typedef:

using Container = std::list<std::vector<int> >; // or any other nested container

// to make printing work without further change:
std::ostream& operator<<(std::ostream& os, std::vector<int> const& v)
{
    os << "[";
    std::copy(v.begin(), v.end(), std::ostream_iterator<int>(os, " "));
    return os << "]";
}

Prints

Parse success
Integers read with header 'Header - name1':
[1 1 12 ] [3 6 234 ]
Integers read with header 'Header - name2':
[3 3 14 ] [5 8 345 ]

Upvotes: 1

Come Raczy
Come Raczy

Reputation: 1680

On this specific problem, if you want the quickest, I would recommend manual parsing 1 char at a time. Boost Spirit would probably come as a close second and save you lots of ugly code.

Manual parsing one char at a time is key to high speed, as even well optimized converters like atoi and strtol have to deal with many different numeric representations while your example seems to imply that you are only interested in plain unsigned integers. Formatted IOs (scanf, operator<<, etc.) are very slow. Reading lines into intermediate strings will probably have a visible cost.

Your problem is simple enough to parse manually, assuming that the header lines do not contain any '\t' (and assuming that there aren't any IO or format errors):

#include <iostream>
#include <sstream>
#include <vector>
#include <string>

std::vector<unsigned> parse(std::istream &is)
{
    bool skipField = true;
    char c;
    unsigned value = 0;
    std::vector<unsigned> result;
    while (is.get(c))
    {
        if (('\t' == c) || ('\n' == c))
        {
            if (!skipField)
            {
                result.push_back(value);
            }
            skipField = ('\n' == c);
            value = 0;
        }
        else if (!skipField)
        {
            value *= 10;
            value += (c - '0');
        }
    }
    return result;
}

int main()
{
    const std::string data = ">Header-name\nID1\t1\t1\t12\nID2\t3\t6\t234\n";
    std::istringstream is(data);
    const std::vector<unsigned> v = parse(is);
    for (unsigned u: v)
    {
        std::cerr << u << std::endl;
    }
}

Upvotes: 1

Vlad from Moscow
Vlad from Moscow

Reputation: 310980

You can use something like the following only instead of the string array I used you will get strings from a file

#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>

int main() 
{
    std::string s[] = { "ID1    1    1   12", "ID2    3    6   234" };
    std::vector<int> v;

    for ( const std::string &t : s )
    {
        std::istringstream is( t );
        std::string tmp;

        is >> tmp;

        v.insert( v.end(), std::istream_iterator<int>( is ), 
                           std::istream_iterator<int>() );
    }                         

    for ( int x : v ) std::cout << x << ' ';
    std::cout << std::endl;

    return 0;
}

The output is

1 1 12 3 6 234 

As for the header then you can check whether tmp is a header and if so you will skip this record.

Here is a simplified version

#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>

int main() 
{
    std::string s[] = 
    { 
        "ID1    1    1   12", 
        ">Header-name", 
        "ID2    3    6   234" 
    };

    std::vector<int> v;

    for ( const std::string &t : s )
    {
        std::istringstream is( t );
        std::string tmp;

        is >> tmp;

        if ( tmp[0] == '>' ) continue;

        v.insert( v.end(), std::istream_iterator<int>( is ), 
                           std::istream_iterator<int>() );
    }                         

    for ( int x : v ) std::cout << x << ' ';
    std::cout << std::endl;

    return 0;
}

The output will be the same as above.

Upvotes: 0

kiss-o-matic
kiss-o-matic

Reputation: 1181

Do you have to use boost? I've used this function for a while. I believe I got it out of Accelerated C++ and have used it since. Your delimiter seems to be a tab, or multiple white spaces. If you pass the delimiter a " " it might work. I think it will depend on what's actually there though.

std::vector<std::string> split( const std::string& line, const std::string& del )
{
        std::vector<std::string> ret;
        size_t i = 0;

        while ( i != line.size() ) {

                while ( ( i != line.size() ) && ( line.substr(i, 1) == del ) ) {
                        ++i;
                }

                size_t j = i;

                while ( ( j != line.size() ) && ( line.substr(j, 1) != del ) ) {
                        ++j;
                }

                if ( i != j ) {
                        ret.push_back( line.substr( i, j - i ) );
                        i = j;
                }
        }

        return ret;
}

You can get each line with this:

int main() {
    std::string line;
    std::vector<std::string> lines; 
    while ( std::getline( std::cin, line ) ) {
        lines.push_back( line );
    }

    for ( auto it = lines.begin(); it != lines.end(); it++ ) {
        std::vector<string> vec = split( (*it) );
        // Do something
    }
}

You can get it to return std::vector with a quick modification. Make each string an int with atoi( myString.c_str() ) Also you'll want to put a check in to skip the headers. Should be trivial.

Note that I've not compiled that above. ;)

Upvotes: 1

Related Questions