Reputation: 117
I have about 25 millions of integers separated by lines in my text file. My first task is to take those integers and sort them. I have actually achieved to read the integers and put them into an array (since my sorting function takes an unsorted array as an argument). However, this reading the integers from a file is a very long and an expensive process. I have searched many other solutions to get the cheaper and efficient way of doing this but I was not able to find one that tackles with such sizes. Therefore, what would your suggestion be to read the integers from the huge (about 260MB) text file. And also how can I get the number of lines efficiently for the same problem.
ifstream myFile("input.txt");
int currentNumber;
int nItems = 25000000;
int *arr = (int*) malloc(nItems*sizeof(*arr));
int i = 0;
while (myFile >> currentNumber)
{
arr[i++] = currentNumber;
}
This is just how I get the integers from the text file. It is not that complicated. I assumed the number of lines are fixed (actually it is fixed)
By the way, it is not too slow of course. It completes reading in approximately 9 seconds in OS X with 2.2GHz i7 processor. But I feel it could be much better.
Upvotes: 5
Views: 6157
Reputation: 7798
One possible solution would be dividing the large file into smaller chunks. Sort each chunk separately and then merge all the sorted chunks one by one.
EDIT: Apparently this is a well-established method. See 'External merge sort' at http://en.wikipedia.org/wiki/External_sorting
Upvotes: 0
Reputation: 154037
You don't say how you are reading the values, so it's hard to say. Still, there are really only two solutions: `someIStream
anInt
and
fscanf( someFd, "%d", &anInt )` Logically, these should have similar performance, but implementations vary; it might be worth trying and measuring both.
Another thing to check is how you're storing them. If you know
you have about 25 million, doing a reserve
of 30 million on
the std::vector
before reading them would probably help. It
might also be cheaper to construct the vector
with 30 million
elements, then trim it when you've seen the end, rather than
using push_back
.
Finally, you might consider writing a immapstreambuf
, and
using that to mmap
the input, and read it directly from the
mapped memory. Or even iterating over it manually, calling
strtol
(but that's a lot more work); all of the streaming
solutions probably end up calling strtol
, or something
similar, but doing significant work around the call first.
FWIW, I did some very quick tests on my home machine (a fairly recent LeNova, running Linux), and the results surprised me:
As a reference, I did the trivial, naïve implementation, using
std::cin >> tmp
and v.push_back( tmp );
, with no attempts to
optimize. On my system, this ran in just under 10 seconds.
Simple optimizations, such as using reserve
on the vector,
or initially creating the vector with a size of 25000000, didn't
change much—the time was still over 9 seconds.
Using a very simple mmapstreambuf
, the time dropped to
around 3 seconds—with the simplest loop, no reserve
,
etc.
Using fscanf
, the time dropped to just under 3 seconds. I
suspect that the Linux implementation of FILE*
also uses
mmap
(and std::filebuf
doesn't).
Finally, using a mmapbuffer
, iterating with two char*
, and
using stdtol to convert, the time dropped to under a second,
These tests were done very quickly (less than an hour to write and run all of them), and are far from rigorous (and of course, don't tell you anything about other environments), but the differences surprised me. I didn't expect as much difference.
Upvotes: 0
Reputation: 129524
Most likely, any optimisation on this is likely to have rather little effect. On my machine, the limiting factor for reading large files is the disk transfer speed. Yes, improving the read speed can improve it a little bit, but most likely, you won't get very much from that.
I found in a previous test [I'll see if I can find the answer with that in it - I couldn't find the source in my "experiment code for SO" directory] that the fastest way is to load the file using mmap
. But it's only marginally faster than using ifstream
.
Edit: my home-made benchmark for reading a file in a few different ways. getline while reading a file vs reading whole file and then splitting based on newline character
As per usual, benchmarks measure what the benchmark measures, and small changes to either the environment or the way the code is written can sometimes make a big difference.
Edit: Here are a few implementations of "read a number from a file and store it in a vector":
#include <iostream>
#include <fstream>
#include <vector>
#include <sys/time.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
using namespace std;
const char *file_name = "lots_of_numbers.txt";
void func1()
{
vector<int> v;
int num;
ifstream fin(file_name);
while( fin >> num )
{
v.push_back(num);
}
cout << "Number of values read " << v.size() << endl;
}
void func2()
{
vector<int> v;
v.reserve(42336000);
int num;
ifstream fin(file_name);
while( fin >> num )
{
v.push_back(num);
}
cout << "Number of values read " << v.size() << endl;
}
void func3()
{
int *v = new int[42336000];
int num;
ifstream fin(file_name);
int i = 0;
while( fin >> num )
{
v[i++] = num;
}
cout << "Number of values read " << i << endl;
delete [] v;
}
void func4()
{
int *v = new int[42336000];
FILE *f = fopen(file_name, "r");
int num;
int i = 0;
while(fscanf(f, "%d", &num) == 1)
{
v[i++] = num;
}
cout << "Number of values read " << i << endl;
fclose(f);
delete [] v;
}
void func5()
{
int *v = new int[42336000];
int num = 0;
ifstream fin(file_name);
char buffer[8192];
int i = 0;
int bytes = 0;
char *p;
int hasnum = 0;
int eof = 0;
while(!eof)
{
fin.read(buffer, sizeof(buffer));
p = buffer;
bytes = 8192;
while(bytes > 0)
{
if (*p == 26) // End of file marker...
{
eof = 1;
break;
}
if (*p == '\n' || *p == ' ')
{
if (hasnum)
v[i++] = num;
num = 0;
p++;
bytes--;
hasnum = 0;
}
else if (*p >= '0' && *p <= '9')
{
hasnum = 1;
num *= 10;
num += *p-'0';
p++;
bytes--;
}
else
{
cout << "Error..." << endl;
exit(1);
}
}
memset(buffer, 26, sizeof(buffer)); // To detect end of files.
}
cout << "Number of values read " << i << endl;
delete [] v;
}
void func6()
{
int *v = new int[42336000];
int num = 0;
FILE *f = fopen(file_name, "r");
char buffer[8192];
int i = 0;
int bytes = 0;
char *p;
int hasnum = 0;
int eof = 0;
while(!eof)
{
fread(buffer, 1, sizeof(buffer), f);
p = buffer;
bytes = 8192;
while(bytes > 0)
{
if (*p == 26) // End of file marker...
{
eof = 1;
break;
}
if (*p == '\n' || *p == ' ')
{
if (hasnum)
v[i++] = num;
num = 0;
p++;
bytes--;
hasnum = 0;
}
else if (*p >= '0' && *p <= '9')
{
hasnum = 1;
num *= 10;
num += *p-'0';
p++;
bytes--;
}
else
{
cout << "Error..." << endl;
exit(1);
}
}
memset(buffer, 26, sizeof(buffer)); // To detect end of files.
}
fclose(f);
cout << "Number of values read " << i << endl;
delete [] v;
}
void func7()
{
int *v = new int[42336000];
int num = 0;
FILE *f = fopen(file_name, "r");
int ch;
int i = 0;
int hasnum = 0;
while((ch = fgetc(f)) != EOF)
{
if (ch == '\n' || ch == ' ')
{
if (hasnum)
v[i++] = num;
num = 0;
hasnum = 0;
}
else if (ch >= '0' && ch <= '9')
{
hasnum = 1;
num *= 10;
num += ch-'0';
}
else
{
cout << "Error..." << endl;
exit(1);
}
}
fclose(f);
cout << "Number of values read " << i << endl;
delete [] v;
}
void func8()
{
int *v = new int[42336000];
int num = 0;
int f = open(file_name, O_RDONLY);
off_t size = lseek(f, 0, SEEK_END);
char *buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);
int i = 0;
int hasnum = 0;
int bytes = size;
char *p = buffer;
while(bytes > 0)
{
if (*p == '\n' || *p == ' ')
{
if (hasnum)
v[i++] = num;
num = 0;
p++;
bytes--;
hasnum = 0;
}
else if (*p >= '0' && *p <= '9')
{
hasnum = 1;
num *= 10;
num += *p-'0';
p++;
bytes--;
}
else
{
cout << "Error..." << endl;
exit(1);
}
}
close(f);
munmap(buffer, size);
cout << "Number of values read " << i << endl;
delete [] v;
}
struct bm
{
void (*f)();
const char *name;
};
#define BM(f) { f, #f }
bm b[] =
{
BM(func1),
BM(func2),
BM(func3),
BM(func4),
BM(func5),
BM(func6),
BM(func7),
BM(func8),
};
double time_to_double(timeval *t)
{
return (t->tv_sec + (t->tv_usec/1000000.0)) * 1000.0;
}
double time_diff(timeval *t1, timeval *t2)
{
return time_to_double(t2) - time_to_double(t1);
}
int main()
{
for(int i = 0; i < sizeof(b) / sizeof(b[0]); i++)
{
timeval t1, t2;
gettimeofday(&t1, NULL);
b[i].f();
gettimeofday(&t2, NULL);
cout << b[i].name << ": " << time_diff(&t1, &t2) << "ms" << endl;
}
for(int i = sizeof(b) / sizeof(b[0])-1; i >= 0; i--)
{
timeval t1, t2;
gettimeofday(&t1, NULL);
b[i].f();
gettimeofday(&t2, NULL);
cout << b[i].name << ": " << time_diff(&t1, &t2) << "ms" << endl;
}
}
Results (two consecutive runs, forwards and backwards to avoid file-caching benefits):
Number of values read 42336000
func1: 6068.53ms
Number of values read 42336000
func2: 6421.47ms
Number of values read 42336000
func3: 5756.63ms
Number of values read 42336000
func4: 6947.56ms
Number of values read 42336000
func5: 941.081ms
Number of values read 42336000
func6: 962.831ms
Number of values read 42336000
func7: 2572.4ms
Number of values read 42336000
func8: 816.59ms
Number of values read 42336000
func8: 815.528ms
Number of values read 42336000
func7: 2578.6ms
Number of values read 42336000
func6: 948.185ms
Number of values read 42336000
func5: 932.139ms
Number of values read 42336000
func4: 6988.8ms
Number of values read 42336000
func3: 5750.03ms
Number of values read 42336000
func2: 6380.36ms
Number of values read 42336000
func1: 6050.45ms
In summary, as someone pointed out in the comments, the actual parsing of integers is quite a substantial part of the whole time, so reading the file isn't quite as critical as I first made out. Even a very naive way of reading the file (using fgetc()
beats the ifstream operator>>
for integers.
As can be seen, using mmap
to load the file is slightly faster than reading the file via fstream
, but only marginally so.
Upvotes: 8
Reputation: 49329
It will be pretty straightforward with Qt:
QFile file("h:/1.txt");
file.open(QIODevice::ReadOnly);
QDataStream in(&file);
QVector<int> ints;
ints.reserve(25000000);
while (!in.atEnd()) {
int integer;
qint8 line;
in >> integer >> line; // read an int into integer, a char into line
ints.append(integer); // append the integer to the vector
}
At the end, you have the ints
QVector you can easily sort. The number of lines is the same as the size of the vector, provided the file was properly formatted.
On my machine, i7 3770k @4.2 Ghz, it takes about 490 milliseconds to read 25 million ints and put them into a vector. Reading from a regular mechanical HDD, not SSD.
Buffering the entire file into memory didn't help all that much, time dropped to 420 msec.
Upvotes: 0
Reputation: 23
You can use external sorting to sort values in your file without loading them all into memory. Sorting speed will be limited by your hard drive capabilities, but you will be able to mess with really huge files. Here is the implementation.
Upvotes: 2
Reputation: 2629
I would do it this way :
#include <fstream>
#include <iostream>
#include <string>
using namespace std;
int main() {
fstream file;
string line;
int intValue;
int lineCount = 0;
try {
file.open("myFile.txt", ios_base::in); // Open to read
while(getline(file, line)) {
lineCount++;
try {
intValue = stoi(line);
// Do something with your value
cout << "Value for line " << lineCount << " : " << intValue << endl;
} catch (const exception& e) {
cerr << "Failed to convert line " << lineCount << " to an int : " << e.what() << endl;
}
}
} catch (const exception& e) {
cerr << e.what() << endl;
if (file.is_open()) {
file.close();
}
}
cout << "Line count : " << lineCount << endl;
system("PAUSE");
}
Upvotes: 0
Reputation:
260MB is not that big. You should be able to load the whole thing into memory and then parse through it. Once in you can use a nested loop to read the integers between line endings and convert using the usual functions. I'd try and preallocate sufficient memory for your array of integers before you start.
Oh, and you may find the crude old C-style file access functions are the faster options for things like this.
Upvotes: 0
Reputation: 718
Try reading blocks of integers and parsing those blocks instead of reading line by line.
Upvotes: 0