Reputation: 6651
I have a file of binary data with various character strings sprinkled throughout. I am trying to write a C code to find the first occurrence of user-specified strings in the file. (I know this can be done with bash but I need a C code for other reasons.) The code as it stands is:
#include <stdio.h>
#include <string.h>
#define CHUNK_SIZE 512
int main(int argc, char **argv) {
char *fname = argv[1];
char *tag = argv[2];
FILE *infile;
char *chunk;
char *taglcn = NULL;
long lcn_in_file = 0;
int back_step;
fpos_t pos;
// allocate chunk
chunk = (char*)malloc((CHUNK_SIZE + 1) * sizeof(char));
// find back_step
back_step = strlen(tag) - 1;
// open file
infile = fopen(fname, "r");
// loop
while (taglcn == NULL) {
// read chunk
memset(chunk, 0, (CHUNK_SIZE + 1) * sizeof(char));
fread(chunk, sizeof(char), CHUNK_SIZE, infile);
printf("Read %c\n", chunk[0]);
// look for tag
taglcn = strstr(chunk, tag);
if (taglcn != NULL) {
// if you find tag, add to location the offset in bytes from beginning of chunk
lcn_in_file += (long)(taglcn - chunk);
printf("HEY I FOUND IT!\n");
} else {
// if you don't find tag, add chunk size minus back_step to location and ...
lcn_in_file += ((CHUNK_SIZE - back_step) * sizeof(char));
// back file pointer up by back_step for next read
fseek(infile, -back_step, SEEK_CUR);
fgetpos(infile, &pos);
printf("%ld\n", pos);
printf("%s\n\n\n", chunk);
}
}
printf("%ld\n", lcn_in_file);
fclose(infile);
free(chunk);
}
If you're wondering, back_step
is put in to take care of the unlikely eventuality that the string in question is split by a chunk
boundary.
The file I am trying to examine is about 1Gb in size. The problem is that for some reason I can find any string within the first 9000 or so bytes, but beyond that, strstr
is somehow not detecting any string. That is, if I look for a string located beyond 9000 or so bytes into the file, strstr
does not detect it. The code reads through the entire file and never finds the search string.
I have tried varying CHUNK_SIZE
from 128 to 50000, with no change in results. I have tried varying back_step
as well. I have even put in diagnostic code to print out chunk
character by character when strstr
fails to find the string, and sure enough, the string is exactly where it is supposed to be. The diagnostic output of pos
is always correct.
Can anyone tell me where I am going wrong? Is strstr
the wrong tool to use here?
Upvotes: 2
Views: 456
Reputation: 6063
A binary data file is going to contain '\0' bytes acting as string ends. The more that are in there, the shorter the area strstr
is going to search will be. Note strstr
will consider its work done once it hits a 0 byte.
You can scan the memory in intervals like
while (strlen (chunk) < CHUNKSIZE)
chunk += strlen (chunk) + 1;
i.e. restart after a null byte in the chunk as long as you are still within the chunk.
Upvotes: 1
Reputation: 1
For some really simple code, you can use mmap()
and memcmp()
.
Error checking and proper header files are left as an exercise for the reader (there is at least one bug - another exercise for the reader to find):
int main( int argc, char **argv )
{
// put usable names on command-line args
char *fname = argv[ 1 ];
char *tag = argv[ 2 ];
// mmap the entire file
int fd = open( fname, O_RDONLY );
struct stat sb;
fstat( fd, &sb );
char *contents = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
close( fd );
size_t tag_len = strlen( tag );
size_t bytes_to_check = 1UL + sb.st_size - tag_len;
for ( size_t ii = 0; ii < bytes_to_check; ii++ )
{
if ( !memcmp( contents + ii, tag, tag_len ) )
{
// match found
// (probably want to check if contents[ ii + tag_len ]
// is a `\0' char to get actual string matches)
}
}
munmap( contents, sb.st_len );
return( 0 );
}
That likely won't be anywhere near the fastest way (in general, mmap()
is not going to be anywhere near a performance winner, especially in this use case of simply streaming through a file from beginning to end), but it's simple.
(Note that mmap()
also has problems if the file size changes while it's being read. If the file grows, you won't see the additional data. If the file is shortened, you'll get SIGBUS
when you try to read the removed data.)
Upvotes: 1
Reputation: 144665
The most likely reason for strstr
to fail in your code is the presence of null bytes in the file. Furthermore, you should open the file in binary mode for the file offsets to be meaningful.
To scan for a sequence of bytes in a block, use the memmem()
function. If it is not available on your system, here is a simple implementation:
#include <string.h>
void *memmem(const void *haystack, size_t n1, const void *needle, size_t n2) {
const unsigned char *p1 = haystack;
const unsigned char *p2 = needle;
if (n2 == 0)
return (void*)p1;
if (n2 > n1)
return NULL;
const unsigned char *p3 = p1 + n1 - n2 + 1;
for (const unsigned char *p = p1; (p = memchr(p, *p2, p3 - p)) != NULL; p++) {
if (!memcmp(p, p2, n2))
return (void*)p;
}
return NULL;
}
You would modify your program this way:
#include <errno.h>
#include <stdio.h>
#include <string.h>
void *memmem(const void *haystack, size_t n1, const void *needle, size_t n2);
#define CHUNK_SIZE 65536
int main(int argc, char **argv) {
if (argc < 3) {
fprintf(sderr, "missing parameters\n");
exit(1);
}
// open file
char *fname = argv[1];
FILE *infile = fopen(fname, "rb");
if (infile == NULL) {
fprintf(sderr, "cannot open file %s: %s\n", fname, strerror(errno));
exit(1);
}
char *tag = argv[2];
size_t tag_len = strlen(tag);
size_t overlap_len = 0;
long long pos = 0;
char *chunk = malloc(CHUNK_SIZE + tag_len - 1);
if (chunk == NULL) {
fprintf(sderr, "cannot allocate memory\n");
exit(1);
}
// loop
for (;;) {
// read chunk
size_t chunk_len = overlap_len + fread(chunk + overlap_len, 1,
CHUNK_SIZE, infile);
if (chunk_len < tag_len) {
// end of file or very short file
break;
}
// look for tag
char *tag_location = memmem(chunk, chunk_len, tag, tag_len);
if (tag_location != NULL) {
// if you find tag, add to location the offset in bytes from beginning of chunk
printf("string found at %lld\n", pos + (tag_location - chunk));
break;
} else {
// if you don't find tag, add chunk size minus back_step to location and ...
overlap_len = tag_len - 1;
memmove(chunk, chunk + chunk_len - overlap_len, overlap_len);
pos += chunk_len - overlap_len;
}
}
fclose(infile);
free(chunk);
return 0;
}
Note that the file is read in chunks of CHUNK_SIZE
bytes, which is optimal if CHUNK_SIZE
is a multiple of the file system block size.
Upvotes: 4
Reputation: 753505
Since you say your file is binary, strstr()
will stop scanning at the first null byte in the file.
If you wish to look for patterns in binary data, then the memmem()
function is appropriate, if it is available. It is available on Linux and some other platforms (BSD, macOS, …) but it is not defined as part of standard C or POSIX. It bears roughly the same relation to strstr()
that memcpy()
bears to strcpy()
.
Note that your code should detect the number of bytes read by fread()
and only search on that.
char *tag = …; // Identify the data to be searched for
size_t taglen = …; // Identify the data's length (maybe strlen(tag))
int nbytes;
while ((nbytes = fread(chunk, 1, (CHUNK_SIZE + 1), infile)) > 0)
{
…
tagcln = memmem(chunk, nbytes, tag, taglen);
if (tagcln != 0)
…found it…
…
}
It isn't really clear why you have the +1
on the chunk size. The fread()
function doesn't add null bytes at the end of the data or anything like that. I've left that aspect unchanged, but would probably not use it in my own code.
It is good that you take care of identifying a tag that spans the boundaries between two chunks.
Upvotes: 5