Reputation: 522
In a coding competition specified at this link there is a task where you need to read much data on stdin
, do some calculations and present a whole lot of data on stdout
.
In my benchmarking it is almost only i/o that takes time although I have tried optimizing it as much as possible.
What you have as input is a string (1 <= len <= 100'000
) and q rows of pair of int where q also is 1 <= q <= 100'000
.
I benchmarked my code on a 100 times larger dataset (len = 10M, q = 10M) and this is the result:
Activity time accumulated
Read text: 0.004 0.004
Read numbers: 0.146 0.150
Parse numbers: 0.200 0.350
Calc answers: 0.001 0.351
Format output: 0.037 0.388
Print output: 0.143 0.531
By implementing my own formating and number parsing inline i managed to get the time down to 1/3 of the time when using printf
and scanf
.
However when I uploaded my solution to the competitions webpage my solution took 1.88 seconds (I think that is the total time over 22 datasets). When I look in the high-score there are several implementations (in c++) that finished in 0.05 seconds, nearly 40 times faster than mine! How is that possible?
I guess that I could speed it up a bit by using 2 threads, then I can start calculating and writing to stdout while still reading from stdin. This will however decrease the time to min(0.150, 0.143)
in a theoretical best case on my large dataset. I'm still nowhere close to the highscore..
In the image below you can see the statistics of the consumed time.
The program gets compiled by the website with this options:
gcc -g -O2 -std=gnu99 -static my_file.c -lm
and timed like this:
time ./a.out < sample.in > sample.out
My code looks like this:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_LEN (100000 + 1)
#define ROW_LEN (6 + 1)
#define DOUBLE_ROW_LEN (2*ROW_LEN)
int main(int argc, char *argv[])
{
int ret = 1;
// Set custom buffers for stdin and out
char stdout_buf[16384];
setvbuf(stdout, stdout_buf, _IOFBF, 16384);
char stdin_buf[16384];
setvbuf(stdin, stdin_buf, _IOFBF, 16384);
// Read stdin to buffer
char *buf = malloc(MAX_LEN);
if (!buf) {
printf("Failed to allocate buffer");
return 1;
}
if (!fgets(buf, MAX_LEN, stdin))
goto EXIT_A;
// Get the num tests
int m ;
scanf("%d\n", &m);
char *num_buf = malloc(DOUBLE_ROW_LEN);
if (!num_buf) {
printf("Failed to allocate num_buffer");
goto EXIT_A;
}
int *nn;
int *start = calloc(m, sizeof(int));
int *stop = calloc(m, sizeof(int));
int *staptr = start;
int *stpptr = stop;
char *cptr;
for(int i=0; i<m; i++) {
fgets(num_buf, DOUBLE_ROW_LEN, stdin);
nn = staptr++;
cptr = num_buf-1;
while(*(++cptr) > '\n') {
if (*cptr == ' ')
nn = stpptr++;
else
*nn = *nn*10 + *cptr-'0';
}
}
// Count for each test
char *buf_end = strchr(buf, '\0');
int len, shift;
char outbuf[ROW_LEN];
char *ptr_l, *ptr_r, *out;
for(int i=0; i<m; i++) {
ptr_l = buf + start[i];
ptr_r = buf + stop[i];
while(ptr_r < buf_end && *ptr_l == *ptr_r) {
++ptr_l;
++ptr_r;
}
// Print length of same sequence
shift = len = (int)(ptr_l - (buf + start[i]));
out = outbuf;
do {
out++;
shift /= 10;
} while (shift);
*out = '\0';
do {
*(--out) = "0123456789"[len%10];
len /= 10;
} while(len);
puts(outbuf);
}
ret = 0;
free(start);
free(stop);
EXIT_A:
free(buf);
return ret;
}
Upvotes: 13
Views: 2084
Reputation: 7837
Thanks to your question, I went and solved the problem myself. Your time is better than mine, but I'm still using some stdio functions.
I simply do not think the high score of 0.05 seconds is bona fide. I suspect it's the product of a highly automated system that returned that result in error, and that no one ever verified it.
How to defend that assertion? There's no real algorithmic complexity: the problem is O(n). The "trick" is to write specialized parsers for each aspect of the input (and avoid work done only in debug mode). The total time for 22 trials is 50 milliseconds, meaning each trial averages 2.25 ms? We're down near the threshold of measurability.
Competitions like the problem you addressed yourself to are unfortunate, in a way. They reinforce the naive idea that performance is the ultimate measure of a program (there's no score for clarity). Worse, they encourage going around things like scanf "for performance" while, in real life, getting a program to run correctly and fast basically never entails avoiding or even tuning stdio. In a complex system, performance comes from things like avoiding I/O, passing over the data only once, and minimizing copies. Using the DBMS effectively is often key (as it were), but such things never show up in programming challenges.
Parsing and formatting numbers as text does take time, and in rare circumstances can be a bottleneck. But the answer is hardly ever to rewrite the parser. Rather, the answer is to parse the text into a convenient binary form, and use that. In short: compilation.
That said, a few observations may help.
You don't need dynamic memory for this problem, and it's not helping. The problem statement says the input array may be up to 100,000 elements, and the number of trials may be as many as 100,000. Each trial is two integer strings of up to 6 digits each separated by a space and terminated by a newline: 6 + 1 + 6 + 1 = 14. Total input, maximum is 100,000 + 1 + 6 + 1 + 100,000 * 14: under 16 KB. You are allowed 1 GB of memory.
I just allocated a single 16 KB buffer, and read it in all at once with read(2). Then I made a single pass over that input.
You got suggestions to use asynchronous I/O and threads. The problem statement says you're measured on CPU time, so neither of those help. The shortest distance between two points is a straight line; a single read into statically allocated memory wastes no motion.
One ridiculous aspect of the way they measure performance is that they use gcc -g. That means assert(3) is invoked in code that is measured for performance! I couldn't get under 4 seconds on test 22 until I removed the my asserts.
In sum, you did pretty well, and I suspect the winner you're baffled by is a phantom. Your code does faff about a bit, and you can dispense with dynamic memory and tuning stdio. I bet your time can be trimmed by simplifying it. To the extent that performance matters, that's where I'd direct your attention.
Upvotes: 2
Reputation: 441
Answering this question is tricky because optimization heavily depends on the problem you have. One idea is to look at the content of the file you are trying to read and see if there patterns or things that you can use in your favor. The code you wrote is a "general" solution for reading from a file, executing something and then writing to a file. But if you the file is not randomly generated each time and the content is always the same why not try to write a solution for that file?
On the other hand, you could try to use low-level system functions. One that comes to my thinking is mmap
which allows you to map a file directly to memory and access that memory instead of using scanf
and fgets
.
Another thing I found that might help is in your solutin you are having two while
loops, why not try and use only one? Another thing would be to do some Asynchronous I/O reading, so instead of reading the whole file in a loop, and then doing the calculation in another loop, you can try and read a portion at the beginning, start processing it async and continue reading.
This link might help for the async part
Upvotes: 0
Reputation: 91
You should allocate all your buffers continuously. Allocate a buffer which is the size of all your buffers (num_buff, start, stop) then rearrange the points to the corresponding offsets by their size. This can reduce your cache miss \ page faults.
Since the read and the write operation seems to consume a lot of time you should consider adding threads. One thread should deal with I\O and another should deal with the computation. (It is worth checking if another thread for prints could speed things up as well). Make sure you don't use any locks while doing this.
Upvotes: 0