I'm writing a program that basically searches a directory and all its sub-directories for duplicate files. I have refined both the question and the code according to your suggestions (functions that needed to return a default value have been fixed to do so) so here it goes... Here is the code of the comparing functions: int compare() { int a, b; unsigned char byte1, byte2; while(1) { a = fread(&byte1, 1, 1, file1); b = fread(&byte2, 1, 1, file2); if(a == 0 && b == 0) break; if(a != b) return 1; if(byte2 != byte1) return 1; } return 0; } void startCompare() { char path1[1000], path2[1000]; FILE *reference = fopen("list.comp", "r"); FILE *other = fopen("list2.comp", "r"); int i, flag, j; i = 0; while(fgets(path1, 1000, reference)) { flag = 0; strtok(path1, "\n"); openFile1(path1); for(j = 0; j <= i; ++j) { fgets(path2, 1000, other); } while(fgets(path2, 1000, other)) { strtok(path2, "\n"); openFile2(path2); if(!compare()) { printf("Checking: %s vs. %s --> DUPLICATE\n", path1, path2); flag = 1; break; } else { printf("Checking: %s vs. %s --> DIFFERENT\n", path1, path2); } } if(flag == 1) { printf("Will be deleted.\n"); } } } (The startCompare() function is called first) Now, the directory itself has these files: bloblo bloblo/frofo bloblo/frofo/New Folder bloblo/frofo/New Folder (2) bloblo/frofo/New Folder (2)/New Folder (3) bloblo/frofo/New Folder (2)/New Folder (3)/New Text Document.txt bloblo/frofo/New Folder (2)/New Folder (3)/Untitled4 0.comp 1.comp 2.comp 3.comp 4.comp 5.comp 11.comp 100.comp duplicate_delete.dev duplicate_delete.exe duplicate_delete.layout list.comp list2.comp main.c main.o Makefile.win Untitled5.c Untitled5.exe The output is: Checking: 0.comp vs. 1.comp --> DIFFERENT Checking: 0.comp vs. 100.comp --> DIFFERENT Checking: 0.comp vs. 11.comp --> DIFFERENT Checking: 0.comp vs. 2.comp --> DIFFERENT Checking: 0.comp vs. 3.comp --> DIFFERENT Checking: 0.comp vs. 4.comp --> DIFFERENT Checking: 0.comp vs. 5.comp --> DIFFERENT Checking: 0.comp vs. duplicate_delete.dev --> DIFFERENT Checking: 0.comp vs. duplicate_delete.exe --> DIFFERENT Checking: 0.comp vs. duplicate_delete.layout --> DIFFERENT Checking: 0.comp vs. list.comp --> DIFFERENT Checking: 0.comp vs. list2.comp --> DIFFERENT Checking: 0.comp vs. main.c --> DIFFERENT Checking: 0.comp vs. main.o --> DIFFERENT Checking: 0.comp vs. Makefile.win --> DIFFERENT Checking: 0.comp vs. Untitled5.c --> DIFFERENT Checking: 0.comp vs. Untitled5.exe --> DIFFERENT With return code 0. While what it should be printing is every file being checked with each other and finding that the files 100.comp and 11.comp are a duplicate of each other and the other files unique. So basically, why does it stop there? Why doesn't it continue to check? Is there any way this can be solved?

Reputation: 5

fgets() does not read the whole file

I'm writing a program that basically searches a directory and all its sub-directories for duplicate files. I have refined both the question and the code according to your suggestions (functions that needed to return a default value have been fixed to do so) so here it goes...

Here is the code of the comparing functions:

int compare()
{
    int a, b;
    unsigned char byte1, byte2;

    while(1)
    {
        a = fread(&byte1, 1, 1, file1);
        b = fread(&byte2, 1, 1, file2);
        if(a == 0 && b == 0) break;
        if(a != b) return 1;
        if(byte2 != byte1) return 1;
    }

    return 0;
}

void startCompare()
{
    char path1[1000], path2[1000];
    FILE *reference = fopen("list.comp", "r");
    FILE *other = fopen("list2.comp", "r");
    int i, flag, j;
    i = 0;

    while(fgets(path1, 1000, reference))
    {
        flag = 0;
        strtok(path1, "\n");  
        openFile1(path1);
        for(j = 0; j <= i; ++j)
        {
            fgets(path2, 1000, other);
        }
        while(fgets(path2, 1000, other))
        {
            strtok(path2, "\n");
            openFile2(path2);
            if(!compare())
            {
                printf("Checking: %s vs. %s --> DUPLICATE\n", path1, path2);
                flag = 1;
                break;
            }
            else
            {
                printf("Checking: %s vs. %s --> DIFFERENT\n", path1, path2);
            }
        }
        if(flag == 1)
        {
            printf("Will be deleted.\n");
        }
    }
}

(The startCompare() function is called first)

Now, the directory itself has these files:

bloblo
bloblo/frofo
bloblo/frofo/New Folder
bloblo/frofo/New Folder (2)
bloblo/frofo/New Folder (2)/New Folder (3)
bloblo/frofo/New Folder (2)/New Folder (3)/New Text Document.txt
bloblo/frofo/New Folder (2)/New Folder (3)/Untitled4
0.comp
1.comp
2.comp
3.comp
4.comp
5.comp
11.comp
100.comp
duplicate_delete.dev
duplicate_delete.exe
duplicate_delete.layout
list.comp
list2.comp
main.c
main.o
Makefile.win
Untitled5.c
Untitled5.exe

The output is:

Checking: 0.comp vs. 1.comp --> DIFFERENT
Checking: 0.comp vs. 100.comp --> DIFFERENT
Checking: 0.comp vs. 11.comp --> DIFFERENT
Checking: 0.comp vs. 2.comp --> DIFFERENT
Checking: 0.comp vs. 3.comp --> DIFFERENT
Checking: 0.comp vs. 4.comp --> DIFFERENT
Checking: 0.comp vs. 5.comp --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.dev --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.exe --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.layout --> DIFFERENT
Checking: 0.comp vs. list.comp --> DIFFERENT
Checking: 0.comp vs. list2.comp --> DIFFERENT
Checking: 0.comp vs. main.c --> DIFFERENT
Checking: 0.comp vs. main.o --> DIFFERENT
Checking: 0.comp vs. Makefile.win --> DIFFERENT
Checking: 0.comp vs. Untitled5.c --> DIFFERENT
Checking: 0.comp vs. Untitled5.exe --> DIFFERENT

With return code 0.

While what it should be printing is every file being checked with each other and finding that the files 100.comp and 11.comp are a duplicate of each other and the other files unique. So basically, why does it stop there? Why doesn't it continue to check? Is there any way this can be solved?

Upvotes: 0

Answers (3)

Luis Colorado

Reputation: 12645

Sorry, but if you are searching for duplicate files, you are using an inefficient approach. It's better to run md5sum(1) on them and sort the listing produced. Then, you'll compare one line with the next for equal md5 values. In case you don't trust md5sum(1) (somebody says that different files can give the same checksum), you can just compare the file contents (but only on the already matching md5 checksums). This is, by far, much more efficient than your approach. It can be solved with:

find <dir> -type f -name "glob_pattern" -print0 | xargs -0 md5sum | sort >files.md5sum

then, edit the file files.md5sum and search for /^\([0-9a-f]*\) .*\n\1/ pattern to get the repeated md5's. You'll get even multiple repetitions of the same file.

Note

Take care that empty files have all the same MD5 checksum and all of them compare equal. You'll get this also in your listing.

Upvotes: 0

Criticize SE actions means ban

Reputation: 58848

To answer the question you actually asked (why doesn't it compare all pairs of files):

Your program first reads the first line from list.comp. Then, it reads each line from list2.comp and compares the files.

Then, it reads the next line from list.comp, and tries to compare it with all the files from list2.comp again. But it's already at the end of list2.comp, so it doesn't read any more filenames from list2.comp.

You could use rewind(other); to "rewind" back to the beginning of list2.comp, so you can read the filenames again.

Upvotes: 0

Weather Vane

Reputation: 34575

I don't know whether this will answer your TLDR question and code, but this is too much for a comment.

Your function compare() never returns 0, if you had enabled and taken notice of compiler warnings, you would have known this. The function also uses the dreaded feof(). See why feof() is wrong

I suggest replacing this

int compare()
{
    while(!feof(file1))
    {
        fread(&byte1, sizeof(unsigned char), 1, file1);
        fread(&byte2, sizeof(unsigned char), 1, file2);
        if(byte2 != byte1) return 1;
    }
    if(feof(file1) && (!feof(file2))) return 1;
    if(feof(file2) && (!feof(file1))) return 1;
}

with this, because checking the amount of data read by fread() is the way to test for end of file.

int compare()
// return 0 if files are the same
// *** always include a comment to tell you what the function does / returns ***
{
    size_t read1, read2;
    while(1) {
        read1 = fread(&byte1, 1, 1, file1);
        read2 = fread(&byte2, 1, 1, file2);
        if (read1==0 && read2==0)
            break;             // success: both files ended
        if (read1 != read2)
            return 1;          // bad: one of them read, other didn't
        if (byte2 != byte1)
            return 1;          // bad: files read different data
    }
    return 0;
}

and note that sizeof(unsigned char) is quite unnecessary, it is 1.

I would also have put byte1 and byte2 as local variables.

Upvotes: 1

fgets() does not read the whole file

Answers (3)

Note

Related Questions