Sequential, subsequent loading of files gets much slower over time

Question

I've got the following code to read and process multiple very big files one after another.

for(j = 0; j < CORES; ++j) {
    double time = omp_get_wtime();
    printf("File: %d, time: %f
", j, time);

    char in[256];
    sprintf(in, "%s.%d", FIN, j);

    FILE* f = fopen(in, "r");
    if (f == NULL)
        fprintf(stderr, "open failed: %s
", FIN);
    int i;
    char buffer[1024];
    char* tweet;
    int takeTime = 1;
    for (i = 0, tweet = TWEETS + (size_t)j*(size_t)TNUM*(size_t)TSIZE; i < TNUM; i++, tweet += TSIZE) {
        double start;
        double end;
        if(takeTime) {
            start = omp_get_wtime();
            takeTime = 0;
        }
        char* line = fgets(buffer, 1024, f);
        if (line == NULL) {
            fprintf(stderr, "error reading line %d
", i);
            exit(2);
        }
        int fn = readNumber(&line);
        int ln = readNumber(&line);
        int month = readMonth(&line);
        int day = readNumber(&line);
        int hits = countHits(line, key);
        writeTweet(tweet, fn, ln, hits, month, day, line);

        if(i%1000000 == 0) {
            end = omp_get_wtime();
            printf("Line: %d, Time: %f
", i, end-start);
            takeTime = 1;
        }
    }
    fclose(f);
}

Every file contains 24000000 tweets and I read 8 files in total, one after another. Each line (1 tweet) gets processed and writeTweet() copies a modified line in one really big char array.

As you can see, I measure the times to see how long it takes to read and process 1 million tweets. For the first file, its about 0.5 seconds per 1 million, which is fast enough. But after every additional file, it takes longer and longer. File 2 takes about 1 second per 1 million lines (but not everytime, just some of the iterations), up to 8 seconds on file number 8. Is this to be expected? Can I speed things up? All files are more or less completely the same, always with 24 million lines.

Edit: Additional information: Every file needs, in processed form, about 730MB RAM. That means, using 8 files we end up with memory need of about 6GB.

As wished, the content of writeTweet()

void writeTweet(char* tweet, const int fn, const int ln, const int hits, const int month, const int day, char* line) {
    short* ptr1 = (short*) tweet;
    *ptr1 = (short) fn;
    int* ptr2 = (int*) (tweet + 2);
    *ptr2 = ln;
    *(tweet + 6) = (char) hits;
    *(tweet + 7) = (char) month;
    *(tweet + 8) = (char) day;

    int i;
    int n = TSIZE - 9;

    for (i = strlen(line); i < n; i++)
        line[i] = ' '; // padding

    memcpy(tweet + 9, line, n);
}

Sequential, subsequent loading of files gets much slower over time

Answers (1)

Related Questions