Reputation: 35
I'm currently trying to find the offset of a string in large files. I know that the string has only one occurrence, but the position in the file can vary.
My first idea was to read the file (which can be a few hundred megabytes easily) into memory first, to speed up the searching.
However this will most likely result in getting the offset in memory, not the real file offset.
How would I get the file offset? Can I somehow map the memory offset to the file offset? Or is there a performant way in doing this directly on the file system?
Some code for reference:
char *buffer;
long fsize = 0;
FILE *fd = fopen("data.bin", "r");
if (fd == NULL)
{
printf("file I/O error.\n");
return 0;
}
fseek(fd, 0, SEEK_END);
fsize = ftell(fd);
fseek (fd, 0, SEEK_SET);
buffer = malloc(fsize);
if (buffer == NULL)
{
printf("error allocating memory.\n");
return 0;
}
fread(buffer, fsize, 1, fd);
fclose(fd);
// FIND STRING "MAGIC" and return FILE offset
How to proceed from here? As stated above, performance is an important aspect.
Upvotes: 1
Views: 1284
Reputation: 18420
The easiest, most efficient and most resource-saving way is not to read the file into a buffer, but to memory-map it and search the string then like this:
int fd = open(filename, O_RDONLY);
off_t length = lseek(fd, 0, SEEK_END);
void *data = mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0);
void *ptr = memmem(data, length, key, keylen);
size_t offset = ptr-data;
munmap(data, length);
close(fd);
This has the big advantage, that you don't have to care about memory management for reading the file, the OS will do it all for you (including caching, read-ahead etc. pp.). If the system is low on memory, the OS will discard in-memory pages of the file automatically.
Upvotes: 2
Reputation: 32586
Use memmem to search in the buffer, (strchr will not work because of the possible null characters in the read file and/or the string to find)
However this will most likely result in getting the offset in memory, not the real file offset.
This is false, the offset is the same
Upvotes: 1