Patrick Bauer
Patrick Bauer

Reputation: 21

Sequential, subsequent loading of files gets much slower over time

I've got the following code to read and process multiple very big files one after another.

for(j = 0; j < CORES; ++j) {
    double time = omp_get_wtime();
    printf("File: %d, time: %f\n", j, time);

    char in[256];
    sprintf(in, "%s.%d", FIN, j);

    FILE* f = fopen(in, "r");
    if (f == NULL)
        fprintf(stderr, "open failed: %s\n", FIN);
    int i;
    char buffer[1024];
    char* tweet;
    int takeTime = 1;
    for (i = 0, tweet = TWEETS + (size_t)j*(size_t)TNUM*(size_t)TSIZE; i < TNUM; i++, tweet += TSIZE) {
        double start;
        double end;
        if(takeTime) {
            start = omp_get_wtime();
            takeTime = 0;
        }
        char* line = fgets(buffer, 1024, f);
        if (line == NULL) {
            fprintf(stderr, "error reading line %d\n", i);
            exit(2);
        }
        int fn = readNumber(&line);
        int ln = readNumber(&line);
        int month = readMonth(&line);
        int day = readNumber(&line);
        int hits = countHits(line, key);
        writeTweet(tweet, fn, ln, hits, month, day, line);

        if(i%1000000 == 0) {
            end = omp_get_wtime();
            printf("Line: %d, Time: %f\n", i, end-start);
            takeTime = 1;
        }
    }
    fclose(f);
}

Every file contains 24000000 tweets and I read 8 files in total, one after another. Each line (1 tweet) gets processed and writeTweet() copies a modified line in one really big char array.

As you can see, I measure the times to see how long it takes to read and process 1 million tweets. For the first file, its about 0.5 seconds per 1 million, which is fast enough. But after every additional file, it takes longer and longer. File 2 takes about 1 second per 1 million lines (but not everytime, just some of the iterations), up to 8 seconds on file number 8. Is this to be expected? Can I speed things up? All files are more or less completely the same, always with 24 million lines.

Edit: Additional information: Every file needs, in processed form, about 730MB RAM. That means, using 8 files we end up with memory need of about 6GB.

As wished, the content of writeTweet()

void writeTweet(char* tweet, const int fn, const int ln, const int hits, const int month, const int day, char* line) {
    short* ptr1 = (short*) tweet;
    *ptr1 = (short) fn;
    int* ptr2 = (int*) (tweet + 2);
    *ptr2 = ln;
    *(tweet + 6) = (char) hits;
    *(tweet + 7) = (char) month;
    *(tweet + 8) = (char) day;

    int i;
    int n = TSIZE - 9;

    for (i = strlen(line); i < n; i++)
        line[i] = ' '; // padding

    memcpy(tweet + 9, line, n);
}

Upvotes: 2

Views: 65

Answers (1)

injecto
injecto

Reputation: 879

Probably, writeTweet() is a bottleneck. If you copy all processed tweets in memory, the huge data array with which the operating system has to do something is formed over time. If you have not enough memory, or other processes in system actively use it, OS will dump (in most cases) part of data on a disk. It increases time of access to the array. There is a more hidden from user eyes mechanisms in OS which can affect performance.

You shouldn't store all processed lines in memory. The simplest way: to dump the processed tweets on a disk (write a file). However solution depends on how you use further the processed tweets. If you not sequentially use data from array, it is worth thinking of special data structure to storage (B-trees?) . Already there is a many libraries for this purpose -- better to look for them.

UPD:

Modern OSs (Linux including) use virtual memory model. For maintenance of this model in a kernel there is a special memory manager who creates special structures of references to real pages in memory. Usually it's maps, for large memory volumes they referenced to sub-maps -- it is rather big branched structure. During work with a big piece of memory it is necessary to address to any pages of memory often randomly. For address acceleration OS uses a special cache. I don't know all subtleties of this process, but I think that in this case cache should be often invalidate because there is no memory for storage all references at the same time. It is expensive operation brings to performance reduction. It will be that more, than memory is more much used.

If you need to sort large tweets array, it isn't obligatory for you to store everything in memory. There are ways to sorting data on a disk. If you want to sort data in memory, it isn't necessary to do the real swap operations on array elements. It's better to use intermediate structure with references to elements in tweets array, and to sort references instead of data.

Upvotes: 2

Related Questions