Reputation: 2088

Read long text word by word without trash

I am trying to read a long text, and am to separate this text into every word it contains. The first attempt I made was reading it from a file using std::ifstream and the operator>> to read into a string. The problem is, since it only cuts text on whitespace characters, I still get periods at the last word of a phrase (like problem.) and some special strings that don't mean nothing (sometimes I have -> or **).

I thought on reading char by char, or splitting the string read char by char too, and finding removing the characters that aren't in the correct range (something between a-z, A-Z and 0-9), but this solution seems very messed up. Also, I could not use regular expressions since I'm using GCC 4.8.3 and it is not possible to use Boost.

Is there a better solution than this second one, or is this the good way? By good I mean relatively easy to implement and yielding the expected result (only alphanumeric characters).

Upvotes: 1

Answers (2)

user2249683

Reputation:

You might install a custom ctype in your stream locale:

#include <iostream>
#include <locale>
#include <sstream>

class WordCharacterClassification : public std::ctype<char>
{
    private:
    typedef std::ctype<char> Base;
    const mask* initialize_table(const Base&);

    public:
    typedef Base::mask mask;
    typedef Base::char_type char_type;

    public:
    WordCharacterClassification(const Base& source, std::size_t refs = 0)
    :   Base(initialize_table(source), false, refs)
    {}


    private:
    mask m_table[Base::table_size];
};

inline const typename WordCharacterClassification::mask*
WordCharacterClassification::initialize_table(const Base& source) {
    const mask* src = source.table();
    const mask* src_end = src + Base::table_size;
    const mask space
        = std::ctype_base::space
        | std::ctype_base::cntrl
        | std::ctype_base::digit
        | std::ctype_base::punct;

    mask* dst = m_table;
    for( ; src < src_end; ++dst, ++src) {
        *dst = *src;
        if(*src & space)
            *dst |= std::ctype_base::space;
    }
    return m_table;
}


int main() {
    std::istringstream in("This->is a delimiter-test4words");
    std::locale locale = in.getloc();

    WordCharacterClassification classification(
        std::use_facet<std::ctype<char>>(locale),
        // We hold a reference and do not transfer ownership:
        true);

    in.imbue(std::locale(locale, &classification));

    std::string word;
    std::cout << "Words:\n";
    while(in >> word) {
        std::cout << word << '\n';
    }
}

Note: A static table (without copying an original) would simplify it.

Upvotes: 1

Greg Hilston

Reputation: 2424

Your second solution would be a implementation and probably help you learn how to handle input. You could handle each character based on isalpha (http://www.cplusplus.com/reference/cctype/isalpha/). Where anything returning false would immediately end "this current word" and start on the next word.

Upvotes: 0

Read long text word by word without trash

Answers (2)

Related Questions