bob.sacamento
bob.sacamento

Reputation: 6651

Subtlety in strstr?

I have a file of binary data with various character strings sprinkled throughout. I am trying to write a C code to find the first occurrence of user-specified strings in the file. (I know this can be done with bash but I need a C code for other reasons.) The code as it stands is:

#include <stdio.h>
#include <string.h>

#define CHUNK_SIZE 512

int main(int argc, char **argv) {
    char *fname = argv[1];
    char *tag = argv[2];
    FILE *infile;
    char *chunk;
    char *taglcn = NULL;
    long lcn_in_file = 0;
    int back_step;
    fpos_t pos;

    // allocate chunk
    chunk = (char*)malloc((CHUNK_SIZE + 1) * sizeof(char));

    // find back_step
    back_step = strlen(tag) - 1;

    // open file
    infile = fopen(fname, "r");

    // loop
    while (taglcn == NULL) { 
        // read chunk
        memset(chunk, 0, (CHUNK_SIZE + 1) * sizeof(char));
        fread(chunk, sizeof(char), CHUNK_SIZE, infile);
        printf("Read %c\n", chunk[0]);
        // look for tag
        taglcn = strstr(chunk, tag);
        if (taglcn != NULL) {
            // if you find tag, add to location the offset in bytes from beginning of chunk
            lcn_in_file += (long)(taglcn - chunk);
            printf("HEY I FOUND IT!\n");
        } else {
            // if you don't find tag, add chunk size minus back_step to location and ...
            lcn_in_file += ((CHUNK_SIZE - back_step) * sizeof(char)); 
            // back file pointer up by back_step for next read 
            fseek(infile, -back_step, SEEK_CUR);
            fgetpos(infile, &pos);
            printf("%ld\n", pos);
            printf("%s\n\n\n", chunk);
        }
    }
    printf("%ld\n", lcn_in_file);

    fclose(infile);
    free(chunk);
}

If you're wondering, back_step is put in to take care of the unlikely eventuality that the string in question is split by a chunk boundary.

The file I am trying to examine is about 1Gb in size. The problem is that for some reason I can find any string within the first 9000 or so bytes, but beyond that, strstr is somehow not detecting any string. That is, if I look for a string located beyond 9000 or so bytes into the file, strstr does not detect it. The code reads through the entire file and never finds the search string.

I have tried varying CHUNK_SIZE from 128 to 50000, with no change in results. I have tried varying back_step as well. I have even put in diagnostic code to print out chunk character by character when strstr fails to find the string, and sure enough, the string is exactly where it is supposed to be. The diagnostic output of pos is always correct.

Can anyone tell me where I am going wrong? Is strstr the wrong tool to use here?

Upvotes: 2

Views: 456

Answers (4)

tofro
tofro

Reputation: 6063

A binary data file is going to contain '\0' bytes acting as string ends. The more that are in there, the shorter the area strstr is going to search will be. Note strstr will consider its work done once it hits a 0 byte.

You can scan the memory in intervals like

while (strlen (chunk) < CHUNKSIZE) 
   chunk += strlen (chunk) + 1;

i.e. restart after a null byte in the chunk as long as you are still within the chunk.

Upvotes: 1

Andrew Henle
Andrew Henle

Reputation: 1

For some really simple code, you can use mmap() and memcmp().

Error checking and proper header files are left as an exercise for the reader (there is at least one bug - another exercise for the reader to find):

int main( int argc, char **argv )
{
    // put usable names on command-line args
    char *fname = argv[ 1 ];
    char *tag = argv[ 2 ];

    // mmap the entire file
    int fd = open( fname, O_RDONLY );
    struct stat sb;
    fstat( fd, &sb );
    char *contents = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
    close( fd );

    size_t tag_len = strlen( tag );

    size_t bytes_to_check = 1UL + sb.st_size - tag_len;

    for ( size_t ii = 0; ii < bytes_to_check; ii++ )
    {
        if ( !memcmp( contents + ii, tag, tag_len ) )
        {
             // match found
             // (probably want to check if contents[ ii + tag_len ]
             //  is a `\0' char to get actual string matches)
        }
    }

    munmap( contents, sb.st_len );

    return( 0 );
}

That likely won't be anywhere near the fastest way (in general, mmap() is not going to be anywhere near a performance winner, especially in this use case of simply streaming through a file from beginning to end), but it's simple.

(Note that mmap() also has problems if the file size changes while it's being read. If the file grows, you won't see the additional data. If the file is shortened, you'll get SIGBUS when you try to read the removed data.)

Upvotes: 1

chqrlie
chqrlie

Reputation: 144665

The most likely reason for strstr to fail in your code is the presence of null bytes in the file. Furthermore, you should open the file in binary mode for the file offsets to be meaningful.

To scan for a sequence of bytes in a block, use the memmem() function. If it is not available on your system, here is a simple implementation:

#include <string.h>

void *memmem(const void *haystack, size_t n1, const void *needle, size_t n2) {
    const unsigned char *p1 = haystack;
    const unsigned char *p2 = needle;

    if (n2 == 0)
        return (void*)p1;
    if (n2 > n1)
        return NULL;

    const unsigned char *p3 = p1 + n1 - n2 + 1;
    for (const unsigned char *p = p1; (p = memchr(p, *p2, p3 - p)) != NULL; p++) {
        if (!memcmp(p, p2, n2))
            return (void*)p;
    }
    return NULL;
}

You would modify your program this way:

#include <errno.h>
#include <stdio.h>
#include <string.h>

void *memmem(const void *haystack, size_t n1, const void *needle, size_t n2);

#define CHUNK_SIZE 65536

int main(int argc, char **argv) {

    if (argc < 3) {
        fprintf(sderr, "missing parameters\n");
        exit(1);
    }

    // open file
    char *fname = argv[1];
    FILE *infile = fopen(fname, "rb");
    if (infile == NULL) {
        fprintf(sderr, "cannot open file %s: %s\n", fname, strerror(errno));
        exit(1);
    }

    char *tag = argv[2];
    size_t tag_len = strlen(tag);
    size_t overlap_len = 0;
    long long pos = 0;

    char *chunk = malloc(CHUNK_SIZE + tag_len - 1);
    if (chunk == NULL) {
        fprintf(sderr, "cannot allocate memory\n");
        exit(1);
    }

    // loop
    for (;;) { 
        // read chunk
        size_t chunk_len = overlap_len + fread(chunk + overlap_len, 1, 
                                               CHUNK_SIZE, infile);
        if (chunk_len < tag_len) {
            // end of file or very short file
            break;
        }
        // look for tag
        char *tag_location = memmem(chunk, chunk_len, tag, tag_len);
        if (tag_location != NULL) {
            // if you find tag, add to location the offset in bytes from beginning of chunk
            printf("string found at %lld\n", pos + (tag_location - chunk));
            break;
        } else {
            // if you don't find tag, add chunk size minus back_step to location and ...
            overlap_len = tag_len - 1;
            memmove(chunk, chunk + chunk_len - overlap_len, overlap_len);
            pos += chunk_len - overlap_len;
        }
    }

    fclose(infile);
    free(chunk);
    return 0;
}

Note that the file is read in chunks of CHUNK_SIZE bytes, which is optimal if CHUNK_SIZE is a multiple of the file system block size.

Upvotes: 4

Jonathan Leffler
Jonathan Leffler

Reputation: 753505

Since you say your file is binary, strstr() will stop scanning at the first null byte in the file.

If you wish to look for patterns in binary data, then the memmem() function is appropriate, if it is available. It is available on Linux and some other platforms (BSD, macOS, …) but it is not defined as part of standard C or POSIX. It bears roughly the same relation to strstr() that memcpy() bears to strcpy().


Note that your code should detect the number of bytes read by fread() and only search on that.

char   *tag = …;     // Identify the data to be searched for
size_t  taglen = …;  // Identify the data's length (maybe strlen(tag))
int     nbytes;
while ((nbytes = fread(chunk, 1, (CHUNK_SIZE + 1), infile)) > 0)
{
    …
    tagcln = memmem(chunk, nbytes, tag, taglen);
    if (tagcln != 0)
        …found it…
    …
}

It isn't really clear why you have the +1 on the chunk size. The fread() function doesn't add null bytes at the end of the data or anything like that. I've left that aspect unchanged, but would probably not use it in my own code.

It is good that you take care of identifying a tag that spans the boundaries between two chunks.

Upvotes: 5

Related Questions