Python implementation faster than C

Question

I apologise if comparisons are not supposed to work this way. I'm new to programming and just curious as to why this is the case.

I have a large binary file containing word embeddings (4.5gb). Each line has a word followed by its embedding which is comprised of 300 float values. I'm simply finding the total number of lines.

For C, I use mmap:

int fd; 
struct stat sb; 
off_t offset = 0, pa_offset;
size_t length, i;
char *addr;
int count = 0;

fd = open("processed_data/crawl-300d-2M.vec", O_RDONLY);
if(fd == -1){
    handle_error("open");
    exit(1);
}

if(fstat(fd, &sb) < 0){
    handle_error("fstat");
    close(fd);
    exit(1);
}

pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1);
if(offset >= sb.st_size){
    fprintf(stderr, "offset is past end of file
");
    exit(EXIT_FAILURE);
}

length = sb.st_size - offset;
addr = mmap(0, (length + offset - pa_offset), PROT_READ, MAP_SHARED, fd, pa_offset);
if (addr == MAP_FAILED) handle_error("mmap");

//Timing only this loop
clock_t begin = clock();
for(i=0;i



This takes 11.283060 seconds.

Python:

file = open('processed_data/crawl-300d-2M.vec', 'r')
count = 0
start_time = timeit.default_timer()
for line in file:
    count += 1
print(count)
elapsed = timeit.default_timer() - start_time
print(elapsed)


This takes 3.0633065439997154 seconds.

Doesn't the Python code read each character to find new lines? If so, why is my C code so inefficient?

Python implementation faster than C

Answers (1)

Related Questions