Reputation: 103
I apologise if comparisons are not supposed to work this way. I'm new to programming and just curious as to why this is the case.
I have a large binary file containing word embeddings (4.5gb). Each line has a word followed by its embedding which is comprised of 300 float values. I'm simply finding the total number of lines.
For C, I use mmap:
int fd;
struct stat sb;
off_t offset = 0, pa_offset;
size_t length, i;
char *addr;
int count = 0;
fd = open("processed_data/crawl-300d-2M.vec", O_RDONLY);
if(fd == -1){
handle_error("open");
exit(1);
}
if(fstat(fd, &sb) < 0){
handle_error("fstat");
close(fd);
exit(1);
}
pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1);
if(offset >= sb.st_size){
fprintf(stderr, "offset is past end of file\n");
exit(EXIT_FAILURE);
}
length = sb.st_size - offset;
addr = mmap(0, (length + offset - pa_offset), PROT_READ, MAP_SHARED, fd, pa_offset);
if (addr == MAP_FAILED) handle_error("mmap");
//Timing only this loop
clock_t begin = clock();
for(i=0;i<length;i++){
if(*(addr+i) == '\n') count++;
}
printf("%d\n", count);
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n", time_spent);
This takes 11.283060 seconds.
Python:
file = open('processed_data/crawl-300d-2M.vec', 'r')
count = 0
start_time = timeit.default_timer()
for line in file:
count += 1
print(count)
elapsed = timeit.default_timer() - start_time
print(elapsed)
This takes 3.0633065439997154 seconds.
Doesn't the Python code read each character to find new lines? If so, why is my C code so inefficient?
Upvotes: 1
Views: 257
Reputation: 148965
Hard to say, because I assume that it will be heavily implementation dependant. But at first glance, the main difference between your Python and C programs is that the C program uses mmap
. It is a very powerful tool (that you do not really need here...) and as such can come with some overhead. As the reference Python implementation is written in C, it is likely that the loop
for line in file:
count += 1
will end in a loop over a tiny function calling fgets
. I would bet a coin that a naive C program using fgets
will be slightly faster than the Python equivalent, because it will save all the Python overhead. But IMHO there is no surprise that using mmap
in C is less efficient than fgets
in Python
Upvotes: 3