Reputation: 21
I'm trying to implement a simplified Boyer-Moore string search algorithm that reads its input text from a file. The algorithm requires that I start at a given file position and read its characters backwards, periodically jumping forward a precomputed number of characters. The jumps are computed based on the pattern's length and indices, so I was storing them as type size_t
. What function should I use to read file characters at specific positions, and what type should I use to store these positions? I'm new to C, but these are the options I've considered:
I could use fseek
and getc
to jump around the file, but this uses a long int
as its character index. I don't know if it's safe to cast between this and a size_t
, and regardless, the GNU C manual recommends against fseeking text streams for portability reasons.
This is supposed to be more portable, but I don't think I can use this to jump forward or backward an arbitrary number of characters.
I could get around the fseek
compatibility issue by opening the file as a binary stream. But I don't know if this could cause other compatibility issues when dealing with pattern/input text, and anyways, this doesn't solve the issue of casting between long int
and size_t
.
I could use file descriptors instead of streams. But then I need to cast between size_t
and off_t
, and I don't know how safe that is. I would also give up FILE
's buffering, which I'm not sure is advisable. I could try to roll my own buffering, or maybe use an alternate library, but this seems like a massive pain.
My first implementation passed the input text as a command line argument, so it didn't deal with file IO at all. But I don't think this would scale well for large text inputs, and the more I've read about file IO the more stuck I feel. What do you suggest?
Upvotes: 2
Views: 79
Reputation: 386551
size_t
⇔ long
conversions
If long
is large enough for a file offset, and if your size_t
value represents a file offset, then there's no problem with converting between these two. (And no need for an explicit cast.)
Portability
So is long
actually large enough for a file offset? long
is well known to be its minimum size on Windows, 32 bits. Even in 64-bit programs. So there could be portability issues if you plan on handling files with a size of 2 GiB or greater while using the fseek
interface. You should have no problems with smaller files.
Jumping forward or backward an arbitrary number of characters
The CRLF line endings used in Windows will bite you here, no matter what interface you use.
It's very likely you can work around this problem. It depends on your definition of "character", and it might depend on how precise the jump needs to be. You haven't provided enough information for us to help you.
Upvotes: 3