AcarX
AcarX

Reputation: 281

Working with big text files

I have a file in following format:

[1]
Parameter1=Value1
.
.
.
End
[2]
.
.

The number between brackets presents id of the entity. There're around 4500 entites. I need to parse through all entites and pick the ones matching my parameters and values. Size of file is around 20mb. My first approach was to reading file line by line and storing them in a struct array like:

struct Component{
    std::string parameter;
    std::string value;
};
struct Entity{
    std::string id;
    std::list<Component> components;
};
std::list<Entity> g_entities;

But this approach took enormous amount of memory and was very slow. I've also tried storing only the ones that match my parameters/values. But that also was really slow and took quite some memory. Ideally i would like to store all data in memory so that i won't have to load the file everytime i need to filter my parameters/values if it's possible with reasonable amount of memory usage.

Edit 1: I read file line by line:

            std::ifstream readTemp(filePath);
            std::stringstream dataStream;
            dataStream << readTemp.rdbuf();
            readTemp.close();

            while (std::getline(dataStream, line)){
                    if (line.find('[') != std::string::npos){
                        // Create Entity
                        Entity entity;

                        // Set entity id
                        entity.id = line.substr(line.find('[') + 1, line.find(']') - 1);

                        // Read all lines until EnumEnd=0
                        while (1){
                            std::getline(dataStream, line);
                            // Break loop if end of entity
                            if (line.find("EnumEnd=0") != std::string::npos){
                                if (CheckMatch(entity))
                                    entities.push_back(entity);
                                entity.components.clear();
                                break;
                            }


                            Component comp;
                            int pos_eq = line.find('='); 
                            comp.parameterId = line.substr(0, pos_eq);
                            comp.value = line.substr(pos_eq + 1);

                            entity.components.push_back(comp);
                        }
                    }
                }

Upvotes: 2

Views: 184

Answers (2)

Valentin H
Valentin H

Reputation: 7448

PS: After your edit. and Comment concerning memory consumption

500MB / 20MB = 25.

If each line is 25 chars long, the memory consumption looks ok.

OK you could use a look-up table for mapping parameter-names to numbers. If the names-set is small, this will save the consumption up to 2 times.

Your data structure could look like this:

std::map<int, std::map<int, std::string> > my_ini_file_data;
std::map<std::string, int> param_to_idx;

(provided the parameter names within sections (entities as you call it) are not unique)

Putting the data is:

std::string param = "Param";
std::string value = "Val";
int entity_id = 0;
if ( param_to_idx.find(param) == param_to_idx.end() )
  param_to_idx[param] = param_to_idx.size();
my_ini_file_data[entity_id][ param_to_idx[param] ] = value;

getting the data is:

    value = my_ini_file_data[entity_id][ param_to_idx[param] ];

If the values-set is also considerably smaller than the number of entries, you could even map values to numbers:

std::map<int, std::map<int, int> > my_ini_file_data;
std::map<std::string, int> param_to_idx;
std::map<std::string, int> value_to_idx;
std::map<int, std::string> idx_to_value;

Putting the data is:

std::string param = "Param";
std::string value = "Val";
int entity_id = 0;
if ( param_to_idx.find(param) == param_to_idx.end() )
      param_to_idx[param] = param_to_idx.size();
if ( value_to_idx.find(value) == value_to_idx.end() )
{  
      int idx = value_to_idx.size();
      value_to_idx[value] = idx;
      idx_to_value[idx] = value;
}

my_ini_file_data[entity_id][ param_to_idx[param] ] = value_to_idx[value];

getting the data is:

value = idx_to_value[my_ini_file_data[entity_id][ param_to_idx[param] ] ];

Hope, this helps.

Initial answer

Concerning memory, I wouldn't care unless you have a kind of embedded system with very small memory.

Concerning the speed, I could give you some suggestions:

Find out, what is the bottleneck.

  1. Use std::list! Using std::vector you re-initialize the memory each time the vector grows. If for some reason you need a vector at the end, create the vector reserving the requires number of entries, which you'll get by calling list::size()
  2. Write a while loop, there you only call getline. If this alone is already slow, read the entire block at once, create a reader-stream out of the char* block and read line by line from the stream.
  3. If the speed of the simple reading is OK, optimize your parsing code. You can reduce the number of find-calls by storing the position. e.g.

    int pos_begin = line.find('[]');
    if (pos_begin != std::string::npos){
        int pos_end = line.find(']');
        if (pos_end != std::string::npos){
            entity.id = line.substr(pos_begin + 1, pos_begin - 1);
    
            // Read all lines until EnumEnd=0
            while (1){
                std::getline(readTemp, line);
                // Break loop if end of entity
                if (line.find("EnumEnd=0") != std::string::npos){
                    if (CheckMatch(entity))
                        entities.push_back(entity);
                    break;
                }
    
    
                Component comp;
                int pos_eq = line.find('=');
    
                comp.parameter= line.substr(0, pos_eq);
                comp.value = line.substr(pos_eq + 1);
    
                entity.components.push_back(comp);
            }
        }
    }
    
    1. Depending on how big your entities are, check if CheckMatch is slow. The smaller the entities, the slower the code - in this case.

Upvotes: 3

Vlad
Vlad

Reputation: 18633

You can use less memory by interning your params and values, so as not to store multiple copies of them.

You could have a map of strings to unique numeric IDs, that you create when loading the file, and then just use the IDs when querying your data structure. At the expense of possibly slower parsing initially, working with these structures afterwards should be faster, as you'd only be matching 32-bit integers rather than comparing strings.

Sketchy proof of concept for storing each string once:

#include <unordered_map>
#include <string>
#include <iostream>

using namespace std;

int string_id(const string& s) {
  static unordered_map<string, int> m;
  static int id = 0;

  auto it = m.find(s);
  if (it == m.end()) {
    m[s] = ++id;
    return id;
  } else {
    return it->second;
  }
}

int main() {
  // prints 1 2 2 1
  cout << string_id("hello") << " ";
  cout << string_id("world") << " "; 
  cout << string_id("world") << " ";
  cout << string_id("hello") << endl; 
}

The unordered_map will end up storing each string once, so you're set for memory. Depending on your matching function, you can define

struct Component {
    int parameter;
    int value;
};

and then your matching can be something like myComponent.parameter == string_id("some_key") or even myComponent.parameter == some_stored_string_id. If you want your strings back, you'll need a reverse mapping as well.

Upvotes: 2

Related Questions