Reputation: 271
In a C program, I need to search for an exact string in a normal file (I use Linux). How shall I do in order to search?
My first assumption consisted of moving each line of the file to the RAM (via fgets()) and, after each move, check if that line was the right string. If it isn't, a loop will re-call fgets() and check the strings until EOF.
But what happens with a file with 150 million lines? Happens that this kind of sequential search seems to be ineffective at all.
However, I was thinking about a kind of binary search, using the insertion sort in order to sort the lines my program add to the file (it adds one line around every 3 seconds, immediately after have checked that that line doesn't appear in the strings file). But then I gave up because I first would needed to move the lines to the RAM, using the same time I would have used for a sequential search. Thus I chose the sequential search.
Is this assumption right? Or is there a better way? I really hope so.
Upvotes: 0
Views: 237
Reputation: 8548
Could you provide more information? What does it mean normal file? What maximum size of file you need to proceed?
If it will be big file and you need to perform fast search follow next prototype of algorithm:
Your search algorithm will depends from index types you will use.(it maybe awl trees, binary trees etc)
For more information you need to read about indexing, searching in index, open source search systems which are exists, for example: apache lucene and sphinx.
UPD: it will be helpful links: Implementing a index for text file content
https://superuser.com/questions/233665/efficient-way-to-index-text-files
Upvotes: 0
Reputation: 527
While trying to match strings over large data, a trick to fasten the process is maybe to create shortcut filters. This technique is actually used by the firefox extension AdBlock Plus to check the incoming page requests and match those with a large list of filters. It then blocks the matched filters(ads). But I digress.
When you match a string with say n literals, algorithmically it takes O(n) time in the best case scenario The trick to reduce the time consumption for matching the string is to reduce the size n. The concept of shortcut filters is that you create short patterns which are contained in the string in question. This way, you spend lesser time checking the correctness of each string. If the filter matches a string, then you check the complete string.
Basically lets say I have 3 strings:
1) ABCDABCD 2) DCBABCDA 3) ABDEFGHI
Suppose I have a pattern "ABC". While iteration, 1st and 2nd string will return a match. The third string is rejected. You then check only those strings which matched the pattern for the correct string. On the other hand. the pattern "EFG" rejects 1 and 2 (in less time) and matches 3.
The matching of the substring may be improved further by using a Hash Table. Here, you may fix the size of the substring say 3 as above. Then, for each string, you calculate hashes for all substrings of length 3. This way, you can quickly determine (in O(1) time) which of the patterns match the string.
Upvotes: 0
Reputation: 55583
You could use mmap
to map the entire file into memory, and then do a strnstr
search:
#include <sys/mman.h>
const char *fileName = "~/test.txt";
long fileLength;
// open file and get it's length
FILE *fptr = fopen(fileName, "r");
if (fptr == NULL || ferror(fptr))
{
perror("could not open file");
fclose(fptr);
return 0;
}
fseek(fptr, 0, SEEK_END);
fileLength = ftell(fptr);
// return to the start of the file
fseek(fptr, 0, SEEK_SET);
// map the file
char *fileData = mmap(NULL, fileLength, PROT_READ, MAP_FILE | MAP_SHARED, fileno(fptr), 0);
if (fileData == MAP_FAILED)
perror("could not map file");
// scan the file
char stringToSearchFor[] = "this is my string!";
if (strnstr(fileData, stringToSearchFor, fileLength) != NULL)
{
printf("file contains the string!");
}
else {
printf("file doesn't contain the string");
}
// clean up our code
munmap(fileData, fileLength);
fclose(fptr);
Upvotes: 2