Reputation: 8308
Typically fscanf
, when scanning a non-integer using %d
, will fail until the non-integer characters are explicitly removed from the input stream. Trying to scan a123
fails, until the a
is removed from the input stream.
Trying to scan ------123
fails (fscanf
returns 0
) but the -
is removed from the input stream.
Is this correct behavior for fscanf
?
The file contains ----------123
and the result of this code:
#include <stdio.h>
int main(void) {
int number = 0;
int result = 0;
FILE *pf = NULL;
if (NULL != (pf = fopen("integer.txt", "r"))) {
while (1) {
if (1 == (result = fscanf(pf, "%d", &number))) {
printf("%d\n", number);
} else {
if (EOF == result) {
break;
}
printf("result is %d\n", result);
}
}
fclose(pf);
}
return 0;
}
is:
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
-123
If the file contains a123
the result is an infinite loop.
That seems to me to be inconsistent behavior. No?
Upvotes: 5
Views: 240
Reputation: 70411
The point here is not one of inconsistency, but one of the many limitations of the fscanf()
family.
The standard is very specific on how fscanf()
parses input. Characters are taken from input one by one, and checked against the format string. If they match, the next character is taken from input. If they don't match, the character is "put back", and the conversion fails.
But only that last character read is ever put back.
C11 7.21.6.2 The fscanf function, paragraph 9 (emphasis mine):
An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. 285) The first character, if any, after the input item remains unread.
- fscanf pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to strtod, strtol, etc., are unacceptable to fscanf.
This one character of push-back has nothing to do with the one character of push-back that ungetc()
guarantees -- it is independent and in addition to that. (A user could have fscanf()
fail, then ungetc()
a character, and expect the ungetc()
'd character to come out of input, followed by the character pushed back by the failed fscanf()
. *A library function may not call ungetc()
, which is reserved to the user.)
This makes implementing the scanning fscanf()
somewhat easier, but also makes fscanf()
fail in the middle of certain character sequences, without actually retracing to where it began its conversion attempt.
In your case, "--123"
read as "%d"
:
'-'
. Sign. All is well, continue.'-'
. Matching error.'-'
. Cannot put back the second '-'
as per above.0
(conversion failed).This is (one of) the reason(s) why you should not ever use *scanf()
on potentially malformed input: The scan can fail without you knowing where exactly it failed, and without properly rolling back.
It's also a murky corner of the standard that was not actually implemented correctly in a number of mainstream library implementations last time I checked. (And not when I re-checked just now.) ;-)
Other reasons not to use fscanf()
on potentially malformed input include, but are not limited to, numerical overflows handled not at all gracefully.
The intended use of fscanf()
is to scan known well-formatted data, ideally data that has been written by that same program using fprintf()
. It is not well-suited to parse user input.
Hence the usual recommendation is to read full lines of input with fgets()
, then parse the line in-memory using strtol()
, strtod()
etc., which can and will handle things like the above in a well-defined way.
Upvotes: 7
Reputation: 145317
This behavior is specified:
Here are the relevant paragraphs from the C2x Standard:
7.21.6.2 The
fscanf
function[...]
7 A directive that is a conversion specification defines a set of matching input sequences, as described below for each specifier. A conversion specification is executed in the following steps:
8 Input white-space characters are skipped, unless the specification includes a[
,c
, orn
specifier.
9 An input item is read from the stream, unless the specification includes ann
specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.310) The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.
10 Except in the case of a%
specifier, the input item (or, in the case of a%n
directive, the count of input characters) is converted to a type appropriate to the conversion specifier. If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. Unless assignment suppression was indicated by a*
, the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.
310)
fscanf
pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable tostrtod
,strtol
, etc., are unacceptable tofscanf
.
In your example, the initial -
is a prefix of a matching input sequence, and the next character, another -
, does not match so it remains in the input stream. The input item, -
, is not a matching sequence so you get a conversion failure and 0
is returned but the first -
was consumed.
This behavior is observed on linux with the GNUlibc, but not on macOS with Apple Libc, where the initial dash is not consumed.
Upvotes: 2
Reputation: 41065
Is this correct behavior for fscanf?
Yes, it is, as pointed out by @stark in comments, -
is part of the result when you use %d
as format specifier.
If you want to scan a positive integer (only digits) you can use a pattern in fscanf
to discard all non digits.
fscanf(pf, "%*[^0-9]%d", &number)
Upvotes: 2