Reputation: 5
I'm writing a program that basically searches a directory and all its sub-directories for duplicate files. I have refined both the question and the code according to your suggestions (functions that needed to return a default value have been fixed to do so) so here it goes...
Here is the code of the comparing functions:
int compare()
{
int a, b;
unsigned char byte1, byte2;
while(1)
{
a = fread(&byte1, 1, 1, file1);
b = fread(&byte2, 1, 1, file2);
if(a == 0 && b == 0) break;
if(a != b) return 1;
if(byte2 != byte1) return 1;
}
return 0;
}
void startCompare()
{
char path1[1000], path2[1000];
FILE *reference = fopen("list.comp", "r");
FILE *other = fopen("list2.comp", "r");
int i, flag, j;
i = 0;
while(fgets(path1, 1000, reference))
{
flag = 0;
strtok(path1, "\n");
openFile1(path1);
for(j = 0; j <= i; ++j)
{
fgets(path2, 1000, other);
}
while(fgets(path2, 1000, other))
{
strtok(path2, "\n");
openFile2(path2);
if(!compare())
{
printf("Checking: %s vs. %s --> DUPLICATE\n", path1, path2);
flag = 1;
break;
}
else
{
printf("Checking: %s vs. %s --> DIFFERENT\n", path1, path2);
}
}
if(flag == 1)
{
printf("Will be deleted.\n");
}
}
}
(The startCompare() function is called first)
Now, the directory itself has these files:
The output is:
Checking: 0.comp vs. 1.comp --> DIFFERENT
Checking: 0.comp vs. 100.comp --> DIFFERENT
Checking: 0.comp vs. 11.comp --> DIFFERENT
Checking: 0.comp vs. 2.comp --> DIFFERENT
Checking: 0.comp vs. 3.comp --> DIFFERENT
Checking: 0.comp vs. 4.comp --> DIFFERENT
Checking: 0.comp vs. 5.comp --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.dev --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.exe --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.layout --> DIFFERENT
Checking: 0.comp vs. list.comp --> DIFFERENT
Checking: 0.comp vs. list2.comp --> DIFFERENT
Checking: 0.comp vs. main.c --> DIFFERENT
Checking: 0.comp vs. main.o --> DIFFERENT
Checking: 0.comp vs. Makefile.win --> DIFFERENT
Checking: 0.comp vs. Untitled5.c --> DIFFERENT
Checking: 0.comp vs. Untitled5.exe --> DIFFERENT
With return code 0.
While what it should be printing is every file being checked with each other and finding that the files 100.comp and 11.comp are a duplicate of each other and the other files unique. So basically, why does it stop there? Why doesn't it continue to check? Is there any way this can be solved?
Upvotes: 0
Views: 1078
Reputation: 12645
Sorry, but if you are searching for duplicate files, you are using an inefficient approach. It's better to run md5sum(1) on them and sort the listing produced. Then, you'll compare one line with the next for equal md5 values. In case you don't trust md5sum(1) (somebody says that different files can give the same checksum), you can just compare the file contents (but only on the already matching md5 checksums). This is, by far, much more efficient than your approach. It can be solved with:
find <dir> -type f -name "glob_pattern" -print0 | xargs -0 md5sum | sort >files.md5sum
then, edit the file files.md5sum
and search for /^\([0-9a-f]*\) .*\n\1/
pattern to get the repeated md5's. You'll get even multiple repetitions of the same file.
Take care that empty files have all the same MD5 checksum and all of them compare equal. You'll get this also in your listing.
Upvotes: 0
Reputation: 58848
To answer the question you actually asked (why doesn't it compare all pairs of files):
Your program first reads the first line from list.comp
. Then, it reads each line from list2.comp
and compares the files.
Then, it reads the next line from list.comp
, and tries to compare it with all the files from list2.comp
again. But it's already at the end of list2.comp
, so it doesn't read any more filenames from list2.comp
.
You could use rewind(other);
to "rewind" back to the beginning of list2.comp
, so you can read the filenames again.
Upvotes: 0
Reputation: 34575
I don't know whether this will answer your TLDR question and code, but this is too much for a comment.
Your function compare()
never returns 0
, if you had enabled and taken notice of compiler warnings, you would have known this. The function also uses the dreaded feof()
. See why feof()
is wrong
I suggest replacing this
int compare()
{
while(!feof(file1))
{
fread(&byte1, sizeof(unsigned char), 1, file1);
fread(&byte2, sizeof(unsigned char), 1, file2);
if(byte2 != byte1) return 1;
}
if(feof(file1) && (!feof(file2))) return 1;
if(feof(file2) && (!feof(file1))) return 1;
}
with this, because checking the amount of data read by fread()
is the way to test for end of file.
int compare()
// return 0 if files are the same
// *** always include a comment to tell you what the function does / returns ***
{
size_t read1, read2;
while(1) {
read1 = fread(&byte1, 1, 1, file1);
read2 = fread(&byte2, 1, 1, file2);
if (read1==0 && read2==0)
break; // success: both files ended
if (read1 != read2)
return 1; // bad: one of them read, other didn't
if (byte2 != byte1)
return 1; // bad: files read different data
}
return 0;
}
and note that sizeof(unsigned char)
is quite unnecessary, it is 1.
I would also have put byte1
and byte2
as local variables.
Upvotes: 1