duong_dajgja
duong_dajgja

Reputation: 4276

Efficiently read data from a structured file in C/C++

I have a file as follows:

enter image description here

The file consists of 2 parts: header and data.

The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).

The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.

The file can be up to 10 GB.

Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).

An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.

Is there any efficient ways?

Upvotes: 4

Views: 472

Answers (3)

duong_dajgja
duong_dajgja

Reputation: 4276

I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of @utnapistim.

Upvotes: 0

utnapistim
utnapistim

Reputation: 27365

An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.

Is there any efficient ways?

Yes. After you get a working solution, measure it's efficiency, then decide which parts you wish to optimize. What and how you optimize will depend greatly on what results you get here (what are your bottlenecks).

A few generic things to consider:

  • if you have one set of steps that read data for a single metric and move it to the output, you should be able to parallelize that (have 20 sets of steps instead of one).
  • a 10Gb file will take a bit to process regardless of what hardware you run your code on (concievably, you could run it on a supercomputer but I am ignoring that case). You / your client may accept a slower solution if it displays it's progress / shows a progress bar.
  • do not use string comparisons for sorting;

Edit (addressing comment)

Consider performing the read as follows:

  • create a list of block offset for the blocks you want to read

  • create a list of worker threads, of fixed size (for example, 10 workers)

  • each idle worker will receive the file name and a block offset, then create a std::ifstream instance on the file, read the block, and return it to a receiving object (and then, request another block number, if any are left).

  • read pages should be passed to a central structure that manages/stores pages.

Also consider managing the memory for the blocks separately (for example, allocate chunks of multiple blocks preemptively, when you know the number of blocks to be read).

Upvotes: 2

Some programmer dude
Some programmer dude

Reputation: 409136

One possible solution is to create an index from the file, containing the page number and the page metric that you need to sort on. Create this index as an array, so that the first entry (index 0) corresponds to the first page, the second entry (index 1) the second page, etc.

Then you sort the index using the metric specified.

When sorted, you end up with a new array which contains a new first, second etc. entries, and you read the input file writing to the output file in the order of the sorted index.

Upvotes: 3

Related Questions