Ujan
Ujan

Reputation: 103

Python implementation faster than C

I apologise if comparisons are not supposed to work this way. I'm new to programming and just curious as to why this is the case.

I have a large binary file containing word embeddings (4.5gb). Each line has a word followed by its embedding which is comprised of 300 float values. I'm simply finding the total number of lines.

For C, I use mmap:

int fd; 
struct stat sb; 
off_t offset = 0, pa_offset;
size_t length, i;
char *addr;
int count = 0;

fd = open("processed_data/crawl-300d-2M.vec", O_RDONLY);
if(fd == -1){
    handle_error("open");
    exit(1);
}

if(fstat(fd, &sb) < 0){
    handle_error("fstat");
    close(fd);
    exit(1);
}

pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1);
if(offset >= sb.st_size){
    fprintf(stderr, "offset is past end of file\n");
    exit(EXIT_FAILURE);
}

length = sb.st_size - offset;
addr = mmap(0, (length + offset - pa_offset), PROT_READ, MAP_SHARED, fd, pa_offset);
if (addr == MAP_FAILED) handle_error("mmap");

//Timing only this loop
clock_t begin = clock();
for(i=0;i<length;i++){
    if(*(addr+i) == '\n') count++;
}
printf("%d\n", count);
clock_t end = clock();  
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n", time_spent);

This takes 11.283060 seconds.

Python:

file = open('processed_data/crawl-300d-2M.vec', 'r')
count = 0
start_time = timeit.default_timer()
for line in file:
    count += 1
print(count)
elapsed = timeit.default_timer() - start_time
print(elapsed)

This takes 3.0633065439997154 seconds.

Doesn't the Python code read each character to find new lines? If so, why is my C code so inefficient?

Upvotes: 1

Views: 257

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 148965

Hard to say, because I assume that it will be heavily implementation dependant. But at first glance, the main difference between your Python and C programs is that the C program uses mmap. It is a very powerful tool (that you do not really need here...) and as such can come with some overhead. As the reference Python implementation is written in C, it is likely that the loop

for line in file:
    count += 1

will end in a loop over a tiny function calling fgets. I would bet a coin that a naive C program using fgets will be slightly faster than the Python equivalent, because it will save all the Python overhead. But IMHO there is no surprise that using mmap in C is less efficient than fgets in Python

Upvotes: 3

Related Questions