Reputation: 271

Search for a string in a normal file

In a C program, I need to search for an exact string in a normal file (I use Linux). How shall I do in order to search?

My first assumption consisted of moving each line of the file to the RAM (via fgets()) and, after each move, check if that line was the right string. If it isn't, a loop will re-call fgets() and check the strings until EOF.

But what happens with a file with 150 million lines? Happens that this kind of sequential search seems to be ineffective at all.

However, I was thinking about a kind of binary search, using the insertion sort in order to sort the lines my program add to the file (it adds one line around every 3 seconds, immediately after have checked that that line doesn't appear in the strings file). But then I gave up because I first would needed to move the lines to the RAM, using the same time I would have used for a sequential search. Thus I chose the sequential search.

Is this assumption right? Or is there a better way? I really hope so.

Upvotes: 0

Answers (4)

Dmitry Zagorulkin

Reputation: 8548

Could you provide more information? What does it mean normal file? What maximum size of file you need to proceed?

If it will be big file and you need to perform fast search follow next prototype of algorithm:

create index of you file
perform search in index
launch indexing process after new information will added in the file.( will be created delta index)
add delta index with the current index Notice: you need cache new information to provide better performance.

Your search algorithm will depends from index types you will use.(it maybe awl trees, binary trees etc)

For more information you need to read about indexing, searching in index, open source search systems which are exists, for example: apache lucene and sphinx.

UPD: it will be helpful links: Implementing a index for text file content

https://superuser.com/questions/233665/efficient-way-to-index-text-files

Upvotes: 0

Saurabh Agarwal

Reputation: 527

While trying to match strings over large data, a trick to fasten the process is maybe to create shortcut filters. This technique is actually used by the firefox extension AdBlock Plus to check the incoming page requests and match those with a large list of filters. It then blocks the matched filters(ads). But I digress.

When you match a string with say n literals, algorithmically it takes O(n) time in the best case scenario The trick to reduce the time consumption for matching the string is to reduce the size n. The concept of shortcut filters is that you create short patterns which are contained in the string in question. This way, you spend lesser time checking the correctness of each string. If the filter matches a string, then you check the complete string.

Basically lets say I have 3 strings:

1) ABCDABCD 2) DCBABCDA 3) ABDEFGHI

Suppose I have a pattern "ABC". While iteration, 1st and 2nd string will return a match. The third string is rejected. You then check only those strings which matched the pattern for the correct string. On the other hand. the pattern "EFG" rejects 1 and 2 (in less time) and matches 3.

The matching of the substring may be improved further by using a Hash Table. Here, you may fix the size of the substring say 3 as above. Then, for each string, you calculate hashes for all substrings of length 3. This way, you can quickly determine (in O(1) time) which of the patterns match the string.

Upvotes: 0

Richard J. Ross III

Reputation: 55583

You could use mmap to map the entire file into memory, and then do a strnstr search:

#include <sys/mman.h>

const char *fileName = "~/test.txt";
long fileLength;

// open file and get it's length
FILE *fptr = fopen(fileName, "r");

if (fptr == NULL || ferror(fptr))
{
    perror("could not open file");
    fclose(fptr);
    return 0;
}

fseek(fptr, 0, SEEK_END);
fileLength = ftell(fptr);
// return to the start of the file
fseek(fptr, 0, SEEK_SET);

// map the file
char *fileData = mmap(NULL, fileLength, PROT_READ, MAP_FILE | MAP_SHARED, fileno(fptr), 0);

if (fileData == MAP_FAILED)
    perror("could not map file");

// scan the file
char stringToSearchFor[] = "this is my string!";
if (strnstr(fileData, stringToSearchFor, fileLength) != NULL)
{
    printf("file contains the string!");
}
else {
    printf("file doesn't contain the string");
}

// clean up our code
munmap(fileData, fileLength);
fclose(fptr);

Upvotes: 2

Blindy

Reputation: 67487

Use mmap to map the file to memory then do a standard memmem search over it. Your OS will take care of reading the file as needed.

Upvotes: 0

Search for a string in a normal file

Answers (4)

Related Questions