Reputation: 180

Multi stream iterators c++

The purpose of my program is to open a text file of m lines of the same length n, read the file column by column and print each column.

For example, for this text file

abcd
efgh 
jklm

I would like to print

a e j
b f k
c g l
d h m

As one line length can be 200 000 000 and the column length can be more than 10 000, I can't open all the file in memory in a matrix.

Theoretically, I would like to have a program that use O(m) in space and O(m*n) in time.

At the beginning, I had to think about these solutions:

if I see all the file for each column the complexity is O(m*n²),
If I use seekg and an array of positions and jump from position to position, the complexity is O(mnlog(n)).

Last point, for some server problems, I need to use only the STL.

My last idea is to create an array of iterators of a file and initialized these iterators at the beginning of each line. After that, to see the next column, I only need to increase each iterator. This is my code

ifstream str2;
str2.open ("Input/test.data", ifstream::in);

int nbline = 3;
int nbcolumn = 4;
int x = 0;

istreambuf_iterator<char> istart (str2);
istreambuf_iterator<char> iend ;

istreambuf_iterator<char>* iarray;
iarray = new istreambuf_iterator<char>[nbline];


while (istart != iend){
    if (x % nbcolumn == 0){
        iarray[x/nbcolumn] = istart;
    }
    istart++;
    x++;
}

for (int j = 0; j<nbcolumn;j++){
    for (int i = 0; i<nbline;i++){
        cout  << *iarray[i] << "\t";
        iarray[i]++;
    }
    cout << endl;
}

Sadly, it does not work, and I have this thing as output

a       e       f       
�       �       �       
�       �       �       
�       �       �

I think the problem is that the array of iterators iarray are not independent of istart, how I can do that?

Upvotes: 14

Answers (5)

Jörg Brüggmann

Reputation: 706

General Considerations

If the array of iterators would work, it would have to iterate thru memory (see also answer William Miller), or where should it iterate thru?

The trade-off is:

Parsing until first output line is complete, than the same for all the other output lines
- slow, almost no memory used
Fill a matrix completely and output the transposed matrix
- lot of memory to be used
Create an array of positions for all output lines, seek thru all positions
- fast, and reasonable memory usage
An very inteligent mix of method 2 and 3.
- taking the shortes possible time with a given memory (e.g. lets say 8 GByte RAM).

Solution for trade-off 4

Neededs more knowledge regarding the boundary condition.

A concept for solution 4 depends on a lot of unknown conditions

What are the characteristics of input data?
- Is the 200TByte for one matrix of for several matrixes?
- For how many?
- What is the worst case of ratio between columns and rows?
- Is it just single characters or can is be words?
- If it is just single characters, is it assured that each line the same memory size?
- If not, how to recognise a new line?
How much free RAM memory is available?
How fast is the target computer to fill the whole free RAM memory?
What is the maximum of time that is acceptable?

Problem Analysis to the original programm

The question is also: Why it doesn't work.

The program ...

#include    <fstream>
#include    <string>
#include    <iostream>

int main(int argc, char* argv[]) {
    std::ifstream str2;
    str2.open ("test.data", std::ifstream::in);

    std::istreambuf_iterator<char> istart(str2);
    std::istreambuf_iterator<char> iend;
    std::istreambuf_iterator<char> iarray1 = istart;

    istart++;
    istart++;
    istart++;
    istart++;
    std::istreambuf_iterator<char> iarray2 = istart;

    std::cout  << *(iarray1);
    std::cout << std::endl;
    std::cout  << *(iarray2);
    std::cout << std::endl;
    return 0;
}

...reads test.data contains...

abcdefghjklm

...and the program prints...

e
e

Hence, the loop...

while (istart != iend){
    if (x % nbcolumn == 0){
        iarray[x/nbcolumn] = istart;
    }
    istart++;
    x++;
}

...will not lead to the expected result, because the iterator is working in a different way, and each call of...

iarray[i]++;

...is manipulating all iterator at the same time.

Solution for trade-off 3

What is the way out? Creating code according to trade-off #3.

The program...

#include    <iostream>
#include    <ios>
#include    <string>
#include    <fstream>

int main(int argc, char* argv[]) {
    int nbline = 3;
    int nbcolumn = 4;
    std::ifstream   fsIn;
    std::streampos  posLine[nbline];
    std::streampos  posTemp;

    fsIn.open("test.data", std::ifstream::in);
    for ( int i = 0; i < nbline; i++) {
        posLine[i] = posTemp;
        posTemp += nbcolumn;
    }

    for ( int j = 0; j < nbcolumn; j++) {
        for ( int i = 0; i < nbline; i++) {
            fsIn.seekg(posLine[i]);
            std::cout  << char(fsIn.get()) << " ";
            posLine[i] = fsIn.tellg();
        }
        std::cout << std::endl;
    }
    return 0;
}

...creates the output:

a e j
b f k
c g l
d h m

Upvotes: 0

Abdurrahim

Reputation: 2144

You cannot use istreambuf_iterator twice it can only be used once. Anyhow hope code below helps you

Let me explain what I am trying to do first; You know file reads are much faster when you do it sequentally. What I am doing there is buffered read. Lets say in your example I am buffering two lines so I have to allocate 6 bytes of buffer and fill it with seeks; Each read will read two bytes as we are holding two lines. This can be optimized though if you print out first character as you read immediately you can buffer two lines just by using 3 bytes and threelines just by buffering 6 bytes in your example. Anyhow I am giving you non optimized version of it.

Again let me remind you, you cannot use istreambuf_iterator twice: How do I use an iterator on an ifstream twice in C++?

if you have to use iterator you can implement your iterator that can seek and read on a file; can be really messy though,,,

#include <iostream>
#include <fstream>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <algorithm>

std::vector<std::size_t> getPositions(std::ifstream& str2, int &numcolumns) {
    std::vector<std::size_t> iarray;

    iarray.push_back(0); // Add first iterator

    bool newlinereached = false;
    int tmpcol = 0;
    int currentLine = 0;
    char currentChar = 0;
    char previosChar = 0;

    numcolumns = -1;

    for (str2.seekg(0, std::ios_base::beg); !str2.eof(); previosChar = currentChar) {
        const std::size_t currentPosition = str2.tellg();
        str2.read(&currentChar, 1);
        if (newlinereached) {
            if (currentChar == '\r') {
                // Always error but skip for now :)
                continue;
            }
            else if (currentChar == '\n') {
                // ERROR CONDITION WHEN if (numcolumns < 0) or previosChar == '\n'
                continue;
            }
            else if (tmpcol == 0) {
                throw std::runtime_error((std::stringstream() << "Line " << currentLine << " is empty").str());
            }
            else {
                if (numcolumns < 0) {
                    // We just found first column size
                    numcolumns = tmpcol;
                    iarray.reserve(numcolumns);
                }
                else if (tmpcol != numcolumns) {
                    throw std::runtime_error((std::stringstream() << "Line " << currentLine
                        << " have incosistend number of columns it should have been " << numcolumns).str());
                }

                iarray.push_back(currentPosition);
                tmpcol = 1;
                newlinereached = false;
            }
        }
        else if (currentChar == '\r' || currentChar == '\n') {
            newlinereached = true;
            ++currentLine;
        }
        else {
            tmpcol++;
        }
    }

    if (currentChar == 0) {
        throw std::runtime_error((std::stringstream() << "Line " << currentLine
            << " contains 'null' character " << numcolumns).str());
    }

    str2.clear(); // Restart 

    return iarray;
}

int main() {
    using namespace std;

    ifstream str2;
    str2.open("Text.txt", ifstream::in);
    if (!str2.is_open()) {
        cerr << "Failed to open the file" << endl;
        return 1;
    }

    int numinputcolumns = -1;

    std::vector<std::size_t> iarray =
        getPositions(str2, numinputcolumns); // S(N)

    const std::size_t numinputrows = iarray.size();

    std::vector<char> buffer;
    const int numlinestobuffer = std::min(2, numinputcolumns); // 1 For no buffer

    buffer.resize(numinputrows * numlinestobuffer); // S(N)

    const std::size_t bufferReadMax = buffer.size();


    for (int j = 0; j < numinputcolumns; j += numlinestobuffer)
    {
        // Seek fill buffer. Needed because sequental reads are much faster even on SSD
        // Still can be optimized more: We can buffer n+1 rows as we can discard current row read
        std::size_t nread = std::min(numlinestobuffer, numinputcolumns - j);
        for (int i = 0; i < numinputrows; ++i)
        {
            str2.seekg(iarray[i], ios_base::beg);
            size_t p = str2.tellg();
            str2.read(&buffer[i * numlinestobuffer], nread);
            iarray[i] += nread;
        }

        // Print the buffer
        for (int b = 0; b < nread; ++b)
        {
            for (int k = 0; k < numinputrows; ++k) {
                std::cout << buffer[b + k * numlinestobuffer] << '\t';
            }
            std::cout << std::endl;
        }
    }

    return 0;
}

Upvotes: 0

Marek R

Reputation: 37697

I would do this like this:

Open the source file.
Measure line size
Measure line count (file size / (line size + size of EOL)). Note EOL can be 2 bytes.
calculate result file size. Open result file and force it to have it desired size, so later you can seek to any part of the file.
peak some size of square which is memory manageable. For example 1024x1024
Now you should load square part of the matrix. 1024 elements for rows of 1024 constitutive rows.
Transpose square
Write it to destination file by seeking to proper column of each part of row you are writing. (you can reduce memory consumption in previous point by transposing one column and then write it as a row, instead transposing whole square at once)
Iterate square over whole file matrix

IMO you can't do it better. Most critical will be how to select size of the square. Big power of 2 is recommended.

Upvotes: 5

William Miller

Reputation: 10320

If you want to do this using multiple std::istreambuf_iterators then you will need multiple fstreams for them to act on, otherwise when you iterate one (i.e. istart++) that will affect all the iterators for that fstream, meaning that the next time you iterate one (i.e. *iarray[i]++) you will skip a character. This is explained more clearly in the reference. Consider this snippet:

std::ifstream str;
str.open("test.data", std::ifstream::in);

std::istreambuf_iterator<char> i1 (str);
std::istreambuf_iterator<char> i2 (str);

std::cout << "i1 - " << *i1 << "   i2 - " << *i2 << std::endl;
i1++;
std::cout << "i1 - " << *i1 << "   i2 - " << *i2 << std::endl;
i2++;
std::cout << "i1 - " << *i1 << "   i2 - " << *i2 << std::endl;

which will output

i1 - a   i2 - a
i1 - b   i2 - a
i1 - b   i2 - c

Where i2 has appeared to 'skip' b in the stream. Even if you assign the second iterator later, i.e.

std::ifstream str;
str.open("test.data", std::ifstream::in);

std::istreambuf_iterator<char> i1 (str);
std::istreambuf_iterator<char> i2;
std::istreambuf_iterator<char> iend;

int x = 0;
while (i1 != iend) {
    if (x % 4 == 0) {
        i2 = i1;
        break;
    }
    x++;
    i1++;
}

std::cout << *i1 << " " << *i2 << std::endl;
i1++;
std::cout << *i1 << " " << *i2 << std::endl;
i2++;
std::cout << *i1 << " " << *i2 << std::endl;

the output remains the same -

i1 - a   i2 - a
i1 - b   i2 - a
i1 - b   i2 - c

Why?

Because in either case both iterators are acting on the same stream object, and every time you iterate one it removes a character from the stream. In the code in question every iterator (istart, iarray[i]) acts on the same stream object and therefore every iteration of one of them removes a char from the stream. The output is then quickly the result of undefined behavior, as iterating beyond the end-of-stream is undefined (and since the iterators are iterating together you reach it quickly).

If you want to do this the way you have outline, you simply need multiple fstream objects, such as

#include <fstream>
#include <string>
#include <iostream>


int main(int argn, char** argv) {
    std::ifstream str2;
    str2.open ("test.data", std::ifstream::in);

    int nbline = 3;
    int nbcolumn = 4;
    int x = 0;

    std::istreambuf_iterator<char> istart (str2);
    std::istreambuf_iterator<char> iend ;

    std::ifstream* streams = new std::ifstream[nbline];
    for (int ii = 0; ii < nbline; ii++) {
        streams[ii].open("test.data", std::ifstream::in);
    }
    std::istreambuf_iterator<char>* iarray = new std::istreambuf_iterator<char>[nbline];
    for (int ii = 0; ii < nbline; ii ++) {
        iarray[ii] = std::istreambuf_iterator<char> (streams[ii]);
    }

    int idx = 0;
    while (istart != iend) {
        if (x % nbcolumn == 0) {
            std::advance(iarray[x/nbcolumn], (nbcolumn+1)*idx);
            idx++;
        }
        x++;
        istart++;
    }

    for (int ii = 0; ii < nbcolumn; ii ++) {
        for (int jj = 0; jj < nbline; jj ++) {
            std::cout << *iarray[jj]++ << "\t";
        }
        std::cout << std::endl;
    }
}

Which produces the output you are expecting,

a       e       j
b       f       k
c       g       l
d       h       m

I can make no comment on the speed of this method relative to others that have been suggested, but this is how you would do what you are asking using this method.

Upvotes: 2

1201ProgramAlarm

Reputation: 32732

You can break the task into chunks, then process each chunk before moving on to the next.

You'd need a buffer for each line (the larger this is the better the performance will be) and the seek position for that row. You may also need to make an initial pass thru the file to get the correct offsets for each row.

Read B bytes into the buffer for each row (using tellg to save the position in each row), then loop over those and generate your output. Go back and read the next B bytes from each row (using seekg to set the file position beforehand, and tellg to remember it afterwards) and generate the output. Repeat until you're done, being careful with the last chunk (or with small inputs) to not go past the end of the line.

Using your example, you have 3 rows to keep track of. Using a B size of 2, you'd read in ab, ef, and jk into your 3 buffers. Looping over those you'd output aej and bfk. Go back and read the next chunks: cd, gh, and lm. This gives cgl and dhm as output.

Upvotes: 6

Multi stream iterators c++

Answers (5)

Why?

Related Questions