Reputation: 180
The purpose of my program is to open a text file of m lines of the same length n, read the file column by column and print each column.
For example, for this text file
abcd
efgh
jklm
I would like to print
a e j
b f k
c g l
d h m
As one line length can be 200 000 000 and the column length can be more than 10 000, I can't open all the file in memory in a matrix.
Theoretically, I would like to have a program that use O(m) in space and O(m*n) in time.
At the beginning, I had to think about these solutions:
Last point, for some server problems, I need to use only the STL.
My last idea is to create an array of iterators of a file and initialized these iterators at the beginning of each line. After that, to see the next column, I only need to increase each iterator. This is my code
ifstream str2;
str2.open ("Input/test.data", ifstream::in);
int nbline = 3;
int nbcolumn = 4;
int x = 0;
istreambuf_iterator<char> istart (str2);
istreambuf_iterator<char> iend ;
istreambuf_iterator<char>* iarray;
iarray = new istreambuf_iterator<char>[nbline];
while (istart != iend){
if (x % nbcolumn == 0){
iarray[x/nbcolumn] = istart;
}
istart++;
x++;
}
for (int j = 0; j<nbcolumn;j++){
for (int i = 0; i<nbline;i++){
cout << *iarray[i] << "\t";
iarray[i]++;
}
cout << endl;
}
Sadly, it does not work, and I have this thing as output
a e f
� � �
� � �
� � �
I think the problem is that the array of iterators iarray are not independent of istart, how I can do that?
Upvotes: 14
Views: 808
Reputation: 706
General Considerations
If the array of iterators would work, it would have to iterate thru memory (see also answer William Miller), or where should it iterate thru?
The trade-off is:
Solution for trade-off 4
Neededs more knowledge regarding the boundary condition.
A concept for solution 4 depends on a lot of unknown conditions
Problem Analysis to the original programm
The question is also: Why it doesn't work.
The program ...
#include <fstream>
#include <string>
#include <iostream>
int main(int argc, char* argv[]) {
std::ifstream str2;
str2.open ("test.data", std::ifstream::in);
std::istreambuf_iterator<char> istart(str2);
std::istreambuf_iterator<char> iend;
std::istreambuf_iterator<char> iarray1 = istart;
istart++;
istart++;
istart++;
istart++;
std::istreambuf_iterator<char> iarray2 = istart;
std::cout << *(iarray1);
std::cout << std::endl;
std::cout << *(iarray2);
std::cout << std::endl;
return 0;
}
...reads test.data contains...
abcdefghjklm
...and the program prints...
e
e
Hence, the loop...
while (istart != iend){
if (x % nbcolumn == 0){
iarray[x/nbcolumn] = istart;
}
istart++;
x++;
}
...will not lead to the expected result, because the iterator is working in a different way, and each call of...
iarray[i]++;
...is manipulating all iterator at the same time.
Solution for trade-off 3
What is the way out? Creating code according to trade-off #3.
The program...
#include <iostream>
#include <ios>
#include <string>
#include <fstream>
int main(int argc, char* argv[]) {
int nbline = 3;
int nbcolumn = 4;
std::ifstream fsIn;
std::streampos posLine[nbline];
std::streampos posTemp;
fsIn.open("test.data", std::ifstream::in);
for ( int i = 0; i < nbline; i++) {
posLine[i] = posTemp;
posTemp += nbcolumn;
}
for ( int j = 0; j < nbcolumn; j++) {
for ( int i = 0; i < nbline; i++) {
fsIn.seekg(posLine[i]);
std::cout << char(fsIn.get()) << " ";
posLine[i] = fsIn.tellg();
}
std::cout << std::endl;
}
return 0;
}
...creates the output:
a e j
b f k
c g l
d h m
Upvotes: 0
Reputation: 2144
You cannot use istreambuf_iterator twice it can only be used once. Anyhow hope code below helps you
Let me explain what I am trying to do first; You know file reads are much faster when you do it sequentally. What I am doing there is buffered read. Lets say in your example I am buffering two lines so I have to allocate 6 bytes of buffer and fill it with seeks; Each read will read two bytes as we are holding two lines. This can be optimized though if you print out first character as you read immediately you can buffer two lines just by using 3 bytes and threelines just by buffering 6 bytes in your example. Anyhow I am giving you non optimized version of it.
Again let me remind you, you cannot use istreambuf_iterator twice: How do I use an iterator on an ifstream twice in C++?
if you have to use iterator you can implement your iterator that can seek and read on a file; can be really messy though,,,
#include <iostream>
#include <fstream>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <algorithm>
std::vector<std::size_t> getPositions(std::ifstream& str2, int &numcolumns) {
std::vector<std::size_t> iarray;
iarray.push_back(0); // Add first iterator
bool newlinereached = false;
int tmpcol = 0;
int currentLine = 0;
char currentChar = 0;
char previosChar = 0;
numcolumns = -1;
for (str2.seekg(0, std::ios_base::beg); !str2.eof(); previosChar = currentChar) {
const std::size_t currentPosition = str2.tellg();
str2.read(¤tChar, 1);
if (newlinereached) {
if (currentChar == '\r') {
// Always error but skip for now :)
continue;
}
else if (currentChar == '\n') {
// ERROR CONDITION WHEN if (numcolumns < 0) or previosChar == '\n'
continue;
}
else if (tmpcol == 0) {
throw std::runtime_error((std::stringstream() << "Line " << currentLine << " is empty").str());
}
else {
if (numcolumns < 0) {
// We just found first column size
numcolumns = tmpcol;
iarray.reserve(numcolumns);
}
else if (tmpcol != numcolumns) {
throw std::runtime_error((std::stringstream() << "Line " << currentLine
<< " have incosistend number of columns it should have been " << numcolumns).str());
}
iarray.push_back(currentPosition);
tmpcol = 1;
newlinereached = false;
}
}
else if (currentChar == '\r' || currentChar == '\n') {
newlinereached = true;
++currentLine;
}
else {
tmpcol++;
}
}
if (currentChar == 0) {
throw std::runtime_error((std::stringstream() << "Line " << currentLine
<< " contains 'null' character " << numcolumns).str());
}
str2.clear(); // Restart
return iarray;
}
int main() {
using namespace std;
ifstream str2;
str2.open("Text.txt", ifstream::in);
if (!str2.is_open()) {
cerr << "Failed to open the file" << endl;
return 1;
}
int numinputcolumns = -1;
std::vector<std::size_t> iarray =
getPositions(str2, numinputcolumns); // S(N)
const std::size_t numinputrows = iarray.size();
std::vector<char> buffer;
const int numlinestobuffer = std::min(2, numinputcolumns); // 1 For no buffer
buffer.resize(numinputrows * numlinestobuffer); // S(N)
const std::size_t bufferReadMax = buffer.size();
for (int j = 0; j < numinputcolumns; j += numlinestobuffer)
{
// Seek fill buffer. Needed because sequental reads are much faster even on SSD
// Still can be optimized more: We can buffer n+1 rows as we can discard current row read
std::size_t nread = std::min(numlinestobuffer, numinputcolumns - j);
for (int i = 0; i < numinputrows; ++i)
{
str2.seekg(iarray[i], ios_base::beg);
size_t p = str2.tellg();
str2.read(&buffer[i * numlinestobuffer], nread);
iarray[i] += nread;
}
// Print the buffer
for (int b = 0; b < nread; ++b)
{
for (int k = 0; k < numinputrows; ++k) {
std::cout << buffer[b + k * numlinestobuffer] << '\t';
}
std::cout << std::endl;
}
}
return 0;
}
Upvotes: 0
Reputation: 37697
I would do this like this:
IMO you can't do it better. Most critical will be how to select size of the square. Big power of 2 is recommended.
Upvotes: 5
Reputation: 10320
If you want to do this using multiple std::istreambuf_iterator
s then you will need multiple fstreams
for them to act on, otherwise when you iterate one (i.e. istart++
) that will affect all the iterators for that fstream
, meaning that the next time you iterate one (i.e. *iarray[i]++
) you will skip a character. This is explained more clearly in the reference. Consider this snippet:
std::ifstream str;
str.open("test.data", std::ifstream::in);
std::istreambuf_iterator<char> i1 (str);
std::istreambuf_iterator<char> i2 (str);
std::cout << "i1 - " << *i1 << " i2 - " << *i2 << std::endl;
i1++;
std::cout << "i1 - " << *i1 << " i2 - " << *i2 << std::endl;
i2++;
std::cout << "i1 - " << *i1 << " i2 - " << *i2 << std::endl;
which will output
i1 - a i2 - a
i1 - b i2 - a
i1 - b i2 - c
Where i2
has appeared to 'skip' b
in the stream. Even if you assign the second iterator later, i.e.
std::ifstream str;
str.open("test.data", std::ifstream::in);
std::istreambuf_iterator<char> i1 (str);
std::istreambuf_iterator<char> i2;
std::istreambuf_iterator<char> iend;
int x = 0;
while (i1 != iend) {
if (x % 4 == 0) {
i2 = i1;
break;
}
x++;
i1++;
}
std::cout << *i1 << " " << *i2 << std::endl;
i1++;
std::cout << *i1 << " " << *i2 << std::endl;
i2++;
std::cout << *i1 << " " << *i2 << std::endl;
the output remains the same -
i1 - a i2 - a
i1 - b i2 - a
i1 - b i2 - c
Because in either case both iterators are acting on the same stream object, and every time you iterate one it removes a character from the stream. In the code in question every iterator (istart
, iarray[i]
) acts on the same stream object and therefore every iteration of one of them removes a char
from the stream. The output is then quickly the result of undefined behavior, as iterating beyond the end-of-stream is undefined (and since the iterators are iterating together you reach it quickly).
If you want to do this the way you have outline, you simply need multiple fstream
objects, such as
#include <fstream>
#include <string>
#include <iostream>
int main(int argn, char** argv) {
std::ifstream str2;
str2.open ("test.data", std::ifstream::in);
int nbline = 3;
int nbcolumn = 4;
int x = 0;
std::istreambuf_iterator<char> istart (str2);
std::istreambuf_iterator<char> iend ;
std::ifstream* streams = new std::ifstream[nbline];
for (int ii = 0; ii < nbline; ii++) {
streams[ii].open("test.data", std::ifstream::in);
}
std::istreambuf_iterator<char>* iarray = new std::istreambuf_iterator<char>[nbline];
for (int ii = 0; ii < nbline; ii ++) {
iarray[ii] = std::istreambuf_iterator<char> (streams[ii]);
}
int idx = 0;
while (istart != iend) {
if (x % nbcolumn == 0) {
std::advance(iarray[x/nbcolumn], (nbcolumn+1)*idx);
idx++;
}
x++;
istart++;
}
for (int ii = 0; ii < nbcolumn; ii ++) {
for (int jj = 0; jj < nbline; jj ++) {
std::cout << *iarray[jj]++ << "\t";
}
std::cout << std::endl;
}
}
Which produces the output you are expecting,
a e j
b f k
c g l
d h m
I can make no comment on the speed of this method relative to others that have been suggested, but this is how you would do what you are asking using this method.
Upvotes: 2
Reputation: 32732
You can break the task into chunks, then process each chunk before moving on to the next.
You'd need a buffer for each line (the larger this is the better the performance will be) and the seek position for that row. You may also need to make an initial pass thru the file to get the correct offsets for each row.
Read B bytes into the buffer for each row (using tellg
to save the position in each row), then loop over those and generate your output. Go back and read the next B bytes from each row (using seekg
to set the file position beforehand, and tellg
to remember it afterwards) and generate the output. Repeat until you're done, being careful with the last chunk (or with small inputs) to not go past the end of the line.
Using your example, you have 3 rows to keep track of. Using a B size of 2, you'd read in ab
, ef
, and jk
into your 3 buffers. Looping over those you'd output aej
and bfk
. Go back and read the next chunks: cd
, gh
, and lm
. This gives cgl
and dhm
as output.
Upvotes: 6