Curtis Jackson
Curtis Jackson

Reputation: 51

Reading Text File and Using 2D Vector

I am using the method below to read a Large space delimited txt files(About 900 Mb). It took me 879s to load the data into memory. I am wondering if there is a more efficient way to read the txt file?

Another associated question is: is it a good idea to store such a huge data set using a 2D vector?

Here is my code

void Grid::loadGrid(const char* filePathGrid)
{
        // 2D vector to contain the matrix
        vector<vector<float>> data;

        unsigned nrows, ncols;
        double xllcorner, yllcorner;
        int cellsize, nodataValue;
        const int nRowHeader = 6;
    string line, strtmp;

    ifstream DEMFile;
    DEMFile.open(filePathGrid);
    if (DEMFile.is_open())
    {           
        // read the header (6 lines)
        for (int index = 0; index < nRowHeader; index++)
        {
            getline(DEMFile, line); 
            istringstream  ss(line);
            switch (index)
            {
                case 0: 
                    while (ss >> strtmp)
                    {                       
                        istringstream(strtmp) >> ncols;                     
                    }
                    break;
                case 1:
                    while (ss >> strtmp)
                    {
                        istringstream(strtmp) >> nrows;                     
                    }
                    break;
                case 2: 
                    while (ss >> strtmp)
                    {
                        istringstream(strtmp) >> xllcorner;                     
                    }
                    break;                  
                case 3:
                    while (ss >> strtmp)
                    {
                        istringstream(strtmp) >> yllcorner;                     
                    }
                    break;                      
                case 4:
                    while (ss >> strtmp)
                    {
                        istringstream(strtmp) >> cellsize;                      
                    }
                    break;                      
                case 5:
                    while (ss >> strtmp)
                    {
                        istringstream(strtmp) >> nodataValue;                       
                    }
                    break;                  
            }           
        }

        // Read in the elevation values
        if (ncols * nrows > 0)
        {       
            // Set up sizes. (rows x cols)
            data.resize(nrows);
            for (unsigned row = 0; row < nrows; ++row)
            {
                data[row].resize(ncols);
            }

            // Load values in   
            unsigned row = 0;
            while (row < nrows)
            {                           
                getline(DEMFile, line);
                istringstream ss(line);
                for (unsigned col =0; col < ncols; col++)
                {
                    ss >> data[row][col];
                }
                row ++;
            }
            DEMFile.close();            
        }       
    }
    else cout << "Unable to open file"; 
}

Upvotes: 3

Views: 455

Answers (3)

Curtis Jackson
Curtis Jackson

Reputation: 51

Great Answers. Using binary format instead of plain text will give me a great performance boost I think.

I had someone send me this code as well. I'll play around some more and come up with a V2.

#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>

#include <deque>
#include <vector>

using namespace std;

/* ////////////////////////////////////////////////////////////////////////////
  timer stuff
//////////////////////////////////////////////////////////////////////////// */

//----------------------------------------------------------------------------
typedef double epoch_time_t;

//----------------------------------------------------------------------------
#ifdef __WIN32__

  #include <windows.h>

  epoch_time_t get_epoch_time()  // Since 1601 Jan 1
    {
    FILETIME f;
    GetSystemTimeAsFileTime( &f );
    LARGE_INTEGER t =
      { {
      f.dwLowDateTime,
      f.dwHighDateTime
      } };
    return t.QuadPart / 10000000.0;  // 100 nano-second intervals
    }

  void waitforkeypress()
    {
    WaitForSingleObject(
      GetStdHandle( STD_INPUT_HANDLE ),
      INFINITE
      );
    }

//----------------------------------------------------------------------------
#else // POSIX

  #include <sys/time.h>

  epoch_time_t get_epoch_time()  // Since 1970 Jan 1
    {
    struct timeval tv;
    if (gettimeofday( &tv, NULL )) return 0.0;
    return tv.tv_sec + tv.tv_usec / 1000000.0;
    }

  #include <unistd.h>
  #include <termios.h>
  #include <poll.h>

  #define INFINITE (-1)

  void waitforkeypress()
    {
    struct termios initial_settings;
    tcgetattr( STDIN_FILENO, &initial_settings );

    struct termios settings = initial_settings;
    settings.c_lflag &= ~(ICANON);
    tcsetattr( STDIN_FILENO, TCSANOW, &settings );

    struct pollfd pls[ 1 ];
    pls[ 0 ].fd     = STDIN_FILENO;
    pls[ 0 ].events = POLLIN | POLLPRI;
    poll( pls, 1, INFINITE );

    tcsetattr( STDIN_FILENO, TCSANOW, &initial_settings );
    }

#endif

//----------------------------------------------------------------------------
ostream& print_epoch_time( ostream& outs, epoch_time_t elapsed_time )
  {
  unsigned minutes    = (unsigned) elapsed_time        / 60;
  unsigned seconds    = (unsigned) elapsed_time        % 60;
  unsigned hundredths = (unsigned)(elapsed_time * 100) % 100;

  outs << "It took you "
       << setfill( '0' ) << minutes    << ":"
       << setw( 2 )      << seconds    << "."
       << setw( 2 )      << hundredths << endl;

  return outs;
  }

/* ////////////////////////////////////////////////////////////////////////////
  loading stuff
//////////////////////////////////////////////////////////////////////////// */

typedef vector <float>     container;
typedef deque  <container> container2;

//----------------------------------------------------------------------------
void load_at_once( const char* filename, string& result )
  {
  ifstream f( filename, ios::binary );
  result.assign( istream_iterator <char> ( f ), istream_iterator <char> () );
  }

//----------------------------------------------------------------------------
struct data_t
  {
  unsigned nrows;
  unsigned ncols;
  double   xllcorner;
  double   yllcorner;
  int      cellsize;
  double   nodatavalue;

  container2 data;
  };

void load_in_pieces( istream& f, data_t& data )
  {
  string s;
  f >> s >> data.ncols
    >> s >> data.nrows
    >> s >> data.xllcorner
    >> s >> data.yllcorner
    >> s >> data.cellsize
    >> s >> data.nodatavalue;
  data.data.resize( data.nrows );
  for (container2::iterator row = data.data.begin(); row != data.data.end(); ++row)
    {
    row->resize( data.ncols );
    for (container::iterator col = row->begin(); col != row->end(); ++col)
      {
      f >> *col;
      }
    }
  }

//----------------------------------------------------------------------------
int main()
  {
  epoch_time_t start, stop;
  string       raw_data;
  data_t       data;

  cout << "Please wait while loading...\n";
  start = get_epoch_time();
  load_at_once( "data.txt", raw_data );
  istringstream iss( raw_data );
  load_in_pieces( iss, data );
  stop = get_epoch_time();
  print_epoch_time( cout, stop - start );

  return 0;
  }

Upvotes: 2

jgreve
jgreve

Reputation: 1253

Not so much an answer as additional questions (these don't format well for a comment).

Observation: I think answering vector or something else better for memory storage is hard to say without knowing more about optimization potential... some of the data-related questions hit on that.

Questions about timing:

Can you modify your timing logic to read header values, then time the following scenarios:

1) Read-only, just pull each line into memory.   Disable the the grid
      parsing & assignment part.  Goal to benchmark raw reads on file.
      Also no "resize" operations on your "data" member.
2) Memory allocation (just read the headers and "resize" the "data" member;
   don't loop through remainder of file).
3) Everything (code as-posted).

Seems to me that reading a file under 1gb in size will be cached by the operating system.
So...
I'd encourage you to run the above 5 times or something and see if subsequent runs are are consistent. (If you don't check for that it might look like you get get a big speedup from a minor change to the code but actually it was just because the data file was cached by the os and your "gains" evaporate w/the next reboot).

Questions about the data file...

To paraphrase, the data file looks like a complete dump of every value in the grid.

Example: In addition to the 6 "header" lines, a 100 x 200 "grid" will have 100 lines in the file with 200 lines on each row. So 6 + 100*200 = 20,006 lines.

You mentioned a 900 MB data file.

I'll make some assumptions, just for the fun of it.
If your values are consistently formatted (e.g. "0.000000" thru "1.000000") that means 8 chars per value.


Assuming simple encoding (1 byte per character) then you'll fit something like a 10,000^2 grid in 900 MB.

Ignoring the header "lines" and end-of-line delims (which will just be rounding errors):
1kb has 1,024 chars
1mb has 1,048,576 chars (1024^2)
900mb has 943,718,400 chars (1024^2 * 900)
8 char element values yields 117,964,800 elements (1024^2*900/8)
Which is roughly a 10,000^2 grid (10,861 from sqrt(1024^2*900/8) ).

Is that even half-way close, size-wise?

What is the largest number grid elements you want to support? Millions? Billions? More?

Seems like any such "limit" is going to be driven by whatever produces your data files.

Questions about the data values


What kind of distribution do you see in your values?

What kind of range are you seeing in your values?

Are values always from 0.0000 - 1.0000 ? Or maybe any valid value for a C++ float?
Can you generate a frequency distribution of values?
e.g. count how many distinct values are in your Grid next time you build it?

This might help to find any values you can try to optimize around?

Perhaps many of the grid elements are zero, or 1, or...?

If so you might consider a sparse matrix approach with an interface that gives callers the illusion
of a fully instantiated grid (e.g. no need to allocate space).

I don't know what the break even point is for a "sparse matrix" approach to begin to pay off.
(Something I wouldn't consider unless I had a common value for at least 30% of my grid elements.)

If you're manipulating "raw" style images, the values will probably be all over the place.
If your values have a pattern, maybe you can exploit that.
I suppose that would be a kind of data compression.

Can you compute a mid-point value and just store offsets from that?
Can the value be computed from their grid coordinates? (I would guess No).

Another comment (@MaruisBancila) mentioned binary format vs. character format.
Makes good sense.

Going from 8 characters to 4 byte floats would cut the file size in half.

Questions about what people will do with a Grid

Once your Grid is populated, what will the "user" do with it?
Are they operating the entire grid?
Just a subset of it?
Can you "window" on your data to make the consumer(s) happy and only load a subset of grid values into memory?
(I would guess No, but never hurts to ask).

Ok, that's all I can think of for questions.

Upvotes: 3

Deshawn Dawkins
Deshawn Dawkins

Reputation: 31

With respect to your second question:

I would have chosen to use a one-dimensional vector, and then index it by (row*ncols+col).

This will at least reduce memory consumption, but it may also have a significant impact on speed.

I don't remember whether a 'vector of vectors' is an endorsed idiom by the standard, but there is a risk that too much copying and memory reallocation is going on, if there is no special handling of the 'vector of vectors' case.

Upvotes: 3

Related Questions