Reputation: 8308

Is this fscanf behavior inconsistent?

Typically fscanf, when scanning a non-integer using %d, will fail until the non-integer characters are explicitly removed from the input stream. Trying to scan a123 fails, until the a is removed from the input stream.

Trying to scan ------123 fails (fscanf returns 0) but the - is removed from the input stream.

Is this correct behavior for fscanf?

The file contains ----------123 and the result of this code:

#include <stdio.h>

int main(void) {
    int number = 0;
    int result = 0;
    FILE *pf = NULL;

    if (NULL != (pf = fopen("integer.txt", "r"))) {
        while (1) {
            if (1 == (result = fscanf(pf, "%d", &number))) {
                printf("%d\n", number);
            } else {
                if (EOF == result) {
                    break;
                }
                printf("result is %d\n", result);
            }
        }
        fclose(pf);
    }
    return 0;
}

is:

result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
-123

If the file contains a123 the result is an infinite loop.

That seems to me to be inconsistent behavior. No?

Upvotes: 5

Answers (3)

DevSolar

Reputation: 70411

The point here is not one of inconsistency, but one of the many limitations of the fscanf() family.

The standard is very specific on how fscanf() parses input. Characters are taken from input one by one, and checked against the format string. If they match, the next character is taken from input. If they don't match, the character is "put back", and the conversion fails.

But only that last character read is ever put back.

C11 7.21.6.2 The fscanf function, paragraph 9 (emphasis mine):

An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. 285) The first character, if any, after the input item remains unread.

fscanf pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to strtod, strtol, etc., are unacceptable to fscanf.

This one character of push-back has nothing to do with the one character of push-back that ungetc() guarantees -- it is independent and in addition to that. (A user could have fscanf() fail, then ungetc() a character, and expect the ungetc()'d character to come out of input, followed by the character pushed back by the failed fscanf(). *A library function may not call ungetc(), which is reserved to the user.)

This makes implementing the scanning fscanf() somewhat easier, but also makes fscanf() fail in the middle of certain character sequences, without actually retracing to where it began its conversion attempt.

In your case, "--123" read as "%d":

taking the first '-'. Sign. All is well, continue.
taking the second '-'. Matching error.
Put back the last '-'. Cannot put back the second '-' as per above.
Return 0 (conversion failed).

This is (one of) the reason(s) why you should not ever use *scanf() on potentially malformed input: The scan can fail without you knowing where exactly it failed, and without properly rolling back.

It's also a murky corner of the standard that was not actually implemented correctly in a number of mainstream library implementations last time I checked. (And not when I re-checked just now.) ;-)

Other reasons not to use fscanf() on potentially malformed input include, but are not limited to, numerical overflows handled not at all gracefully.

The intended use of fscanf() is to scan known well-formatted data, ideally data that has been written by that same program using fprintf(). It is not well-suited to parse user input.

Hence the usual recommendation is to read full lines of input with fgets(), then parse the line in-memory using strtol(), strtod() etc., which can and will handle things like the above in a well-defined way.

Upvotes: 7

chqrlie

Reputation: 145317

This behavior is specified:

Here are the relevant paragraphs from the C2x Standard:

7.21.6.2 The fscanf function

[...]

_⁷   A directive that is a conversion specification defines a set of matching input sequences, as described below for each specifier. A conversion specification is executed in the following steps:
_⁸   Input white-space characters are skipped, unless the specification includes a [, c, or n specifier.
_⁹   An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.³¹⁰⁾ The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.
_¹⁰   Except in the case of a % specifier, the input item (or, in the case of a %n directive, the count of input characters) is converted to a type appropriate to the conversion specifier. If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. Unless assignment suppression was indicated by a *, the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.

^{310) fscanf pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to strtod, strtol, etc., are unacceptable to fscanf.}

In your example, the initial - is a prefix of a matching input sequence, and the next character, another -, does not match so it remains in the input stream. The input item, -, is not a matching sequence so you get a conversion failure and 0 is returned but the first - was consumed.

This behavior is observed on linux with the GNUlibc, but not on macOS with Apple Libc, where the initial dash is not consumed.

Upvotes: 2

David Ranieri

Reputation: 41065

Is this correct behavior for fscanf?

Yes, it is, as pointed out by @stark in comments, - is part of the result when you use %d as format specifier.

If you want to scan a positive integer (only digits) you can use a pattern in fscanf to discard all non digits.

fscanf(pf, "%*[^0-9]%d", &number)

Upvotes: 2

Is this fscanf behavior inconsistent?

Answers (3)

Related Questions