Reputation: 1536
Lately I've stumbled across a post: How to read an entire file into memory in C++ where different techniques for reading files are described. Each approach is commented about its efficiency or risks related to undefined behaviour. Among the list the following example is presented:
// Bad code; undefined behaviour
in.seekg(0, std::ios_base::end);
Which basically in this or a similar form is often used to actually read the file size.
The reasoning presented in the post, in short, is that in C standard (N1570) §7.21.3
it is stated that:
Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state.
Which is footnote 268
for:
A file need not begin nor end in the initial shift state
To confirm the above for C++11 there is an additional reference to C++ standard draft (N3242) 27.9.1.1
which states:
The restrictions on reading and writing a sequence controlled by an object of class basic_filebuf are the same as for reading and writing with the Standard C library FILEs.
Where basic_filebuf
according to cppreference is a part of implementation for basic_ifstream
(internal buffer). Which points that ifstream
implementation should be also burdened with the denoted behaviour.
From what I've understood from the description and from what I've managed to dig, this issue is mostly related to wide-oriented streams which may not end in the initial shift state
.
Seems to me this cannot be a typical case due to the popular usage for file size calculation. Still the topic is not quite clear for me. Hence, the following questions:
initial state shift
? I assume it cannot be related to data clusters. More to the multibyte char encoding but this way wouldn't the problem be only limited to non-binary streams?wide-
and narrow-oriented
streams? I am aware that: "A newly opened stream has no orientation."
and the orientation is decided on the first I/O call to the stream. But in practice are there, let's say, any defaults dependent on stream type, system, locale or something else?Upvotes: 3
Views: 346
Reputation: 3992
"shift state" - limited indeed to multibyte text streams (and EOL treatment \r\n
vs \n
) and this problem will be limited to text streams indeed.
But that is not the only problem. From the article you cited, with my emphasis:
Some platforms store files as fixed-size records. If the file is shorter than the record size, the rest of the block is padded. When you seek to the “end”, for efficiency’s sake it just jumps you right to the end of the last block… possibly long after the actual end of the data, after a bunch of padding.
fseek(p_file, 0, SEEK_END)
followed by ftell(...)
affords a valid answer only as long as EOF
flag is not raised.
Read "The solution (really large files)" section of the cited as it provides the details, particularly:
step 4. "Restore the stream to the start position with seekg()
. This will also clear the EOF flag."
Question in comments:
do you have the knowledge which platforms particularly?
Google landed me on this list of record-oriented file systems - mainly mainframes, some of which are still used.
Another area which may be a "record files" area: "the cloud". You never know when somebody is going to (re)incarnate Distributed Data Management Architecture and hit problems solvable by record-oriented files. For all I know (which is next to nothing), NFS may be already doing it: the RFC speaks about "record lock".
One of top of the other, I'd be paying attention to the standard and treat this matter with respect when writing "truly standard, cross-platform compatible software" in C/C++.
Upvotes: 3