Reputation: 590
I'm working on creating a binary search algorithm in C that searches for a string in a .txt file. Each line is a string representing a stock ticker. Not being familiar with C, this is taking far too long. I have a few questions:
1.) Once I have opened a file using fopen, does it make more sense in terms of efficiency for the algorithm to step through the file using some function provided in the C library for scanning files, doing the compare directly from the file, or should I copy each line into an array and have the algorithm search the array?
2.) If I should compare directly from the file, what is the best way to step through it? Assume I have the number of lines in the file, is there some way to go directly to the middle line, scan the string and do the compare?
I'm sorry if this is too vague. Not too sure how to better explain. Thanks for your time
Upvotes: 4
Views: 2558
Reputation: 372814
This is a great question!
The challenge of binary search is that the benefits of binary search come from being able to skip past half the elements at each step in O(1). This guarantees that, since you only do O(lg n) probes, that the runtime is O(lg n). This is why, for example, you can do a fast binary search on an array but not a linked list - in the linked list, finding the halfway point of the elements takes linear time, which dominates the time for the search.
When doing binary search on a file you are in a similar position. Since all the lines in the file might not have the same length, you can't easily jump to the nth line in the file given some number n. Consequently, implementing a good, fast binary search on a file will be a bit tricky. Somehow, you will need to know where each line starts and stops so that you can efficiently jump around in the file.
There are many ways you can do this. First, you could load all the strings from the file into an array, as you've suggested. This takes linear time, but once you have the array of strings in memory all future binary searches will be very fast. The catch is that if you have a very large file, this may take up a lot of memory, and could be prohibitively expansive. Consequently, another alternative might be not to store the actual stings in the array, but rather the offsets into the file at which each string occurs. This would let you do the binary search quickly - you could seek the file to the proper offset when doing a comparison - and for large stings can be much more space-efficient than the above. And, if all the strings are roughly the same length, you could just pad every line to some fixed size to allow for direct computation of the start position of each line.
If you're willing to expend some time implementing more complex solutions, you might want to consider preprocessing the file so that instead of having one string per line, instead you have at the top of the file a list of fixed-width integers containing the offsets of each string in the file. This essentially does the above work, but then stores the result back in the file to make future binary searches much faster. I have some experience with this sort of file structure, and it can be quite fast.
If you're REALLY up for a challenge, you could alternatively store the strings in the file using a B-tree, which would give you incredibly fast lookup times fir each string by minimizing the number of disk reads that you need to do.
Hope this helps!
Upvotes: 1
Reputation:
I don't see how you can do compare directly from the file. You will have to have a buffer to store data read from disk and use that buffer. So it doesn't make sense, it is just impossible.
You cannot jump to a particular line in the file. Not unless you know the offset in bytes of the beginning of that line relative to the beginning of the file.
I'd recommend using mmap
to map this file directly into memory and work with it as with character array. Operating system will make work with file (like seeking, reading, writing) transparent to you, and you will just work with it like with a buffer in memory. Note that mmap
is limited to 4 GB on 32-bit systems. But if that file is bigger, you probably need to ask the question - why on earth someone has this big file not in an indexed database.
Upvotes: 0
Reputation:
Reading it all in with a bulk read and walking through it with pointers (to memory) is very fast. Avoid doing multiple I/O calls if you can.
I should also mention that memory mapped files can be very suitable for something like this. See mmap() if on Unix. This is definitely your best bet for really large files.
Upvotes: 1
Reputation: 6658
You cannot binary search lines of a text-file without knowing the length of each line in advance, so you'll most likely want to read each line into memory at first (unless the file is very big).
But if your goal is only to search for a single given line as quickly as possible, you might as well just do linear search directly on the file. There's no point in getting O(log n) at the cost of a O(n) setup cost if the search is only done once.
Upvotes: 2
Reputation: 53536
Unless your file is exceedingly big (> 2GB) then loading the file in memory prior searching it is the way to go. In case you cannot load the file in memory, you could hold the offset of each line in an int[]
or (if the file contains too many lines...) create another binary file and write the offset of each lines as integers...
Having everything in memory is by far preferable, though.
Upvotes: 2