pythonvalidationfloating-pointtype-conversion

Reputation: 3433

Python: validate whether string is a float without conversion

Is there a Pythonic way to validate whether a string represents a floating-point number (any input that would be recognizable by float(), e.g. -1.6e3), without converting it (and, ideally, without resorting to throwing and catching exceptions)?

Previous questions have been submitted about how to check if a string represents an integer or a float. Answers suggest using try...except clauses together with the int() and float() built-ins, in a user-defined function.

However, these haven't properly addressed the issue of speed. While using the try...except idiom for this ties the conversion process to the validation process (to some extent rightfully), applications that go over a large amount of text for validation purposes (any schema validator, parsers) will suffer from the overhead of performing the actual conversion. Besides the slowdown due to the actual conversion of the number, there is also the slowdown caused by throwing and catching exceptions. This GitHub gist demonstrates how, compared to user-defined validation only, built-in conversion code is twice as costly (compare True cases), and exception handling time (False time minus True time for the try..except version) alone is as much as 7 validations. This answers my question for the case of integer numbers.

Valid answers will be: functions that solve the problem in a more efficient way than the try..except method, a reference to documentation for a built-in feature that will allow this in the future, a reference to a Python package that allows this now (and is more efficient than the try..except method), or an explanation pointing to documentation of why such a solution is not Pythonic, or will otherwise never be implemented. Specifically, to prevent clutter, please avoid answers such as 'No' without pointing to official documentation or mailing-list debate, and avoid reiterating the try..except method.

Upvotes: 1

Answers (4)

Yuval

Reputation: 3433

As @John mentioned in a comment, this appears as an answer in another question, though it is not the accepted answer in that case. Regular expressions and the fastnumbers module are two solutions to this problem.

However, it's duly noted (as @en_Knight did) that performance depends largely on the inputs. If expecting mostly valid inputs, then the EAFP approach is faster, and arguably more elegant. If you don't know what to input to expect, then LBYL might be more appropriate. Validation, in essence, should expect mostly valid inputs, so it's more appropriate for try..except.

The fact is, for my use case (and as the writer of the question it bears relevance) of identifying types of data in a tabular data file, the try..except method was more appropriate: a column is either all float, or, if it has a non-float value, from that row on it's considered textual, so most of the inputs actually tested for float are valid in either case. I guess all those other answers were on to something.

Back to answer, fastnumbers and regular expressions are still appealing solutions for the general case. Specifically, the fastnumbers package seem to be working well for all values except for special ones, such as Infinity, Inf and NaN, as demonstrated in this GitHub gist. The same goes for the simple regular expression from the aforementioned answer (modified slightly - removed the trailing \b as it would cause some inputs to fail):

^[-+]?(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+\b)?$

A bulkier version, that does recognize the special values, was used in the gist, and has equal performance:

^[-+]?(?:[Nn][Aa][Nn]|[Ii][Nn][Ff](?:[Ii][Nn][Ii][Tt][Yy])?|(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+\b)?)$

The regular expression implementation is ~2.8 times slower on valid inputs, but ~2.2 faster on invalid inputs. Invalid inputs run ~5 times slower than valid ones using try..except, or ~1.3 times faster using regular expressions. Given these results, it means it's favorable to use regular expressions when 40% or more of expected inputs are invalid.

fastnumbers is merely ~1.2 times faster on valid inputs, but ~6.3 times faster on invalid inputs.

Results are described in the plot below. I ran with 10^6 repeats, with 170 valid inputs and 350 invalid inputs (weighted accordingly, so the average time is per a single input). Colors don't show because boxes are too narrow, but the ones on the left of each column describe timings for valid inputs, while invalid inputs are to the right.

NOTE The answer was edited multiple times to reflect on comments both to the question, this answer and other answers. For clarity, edits have been merged. Some of the comments refer to previous versions.

Upvotes: 4

en_Knight

Reputation: 5371

Speedy Solution If You're Sure You Want it

Taking a look at this reference implementation - the conversion to float in python happens in C code and is executed very efficiently. If you really were worried about overhead, you could copy that code verbatim into a custom C extension, but instead of raising the error flag, return a boolean indicating success.

In particular, look at the complicated logic implemented to coerce hex into float. This is done in the C level, with a lot of error cases; it seems highly unlikely there's a shortcut here (note the 40 lines of comments arguing for one particular guarding case), or that any hand-rolled implementation will be faster while preserving these cases.

But... Necessary?

As a hypothetical, this question is interesting, but in the general case one should try to profile their code to ensure that the try catch method is adding overhead. Try/catch is often idiomatic and moreover can be faster depending on your usage. For example, for-loops in python use try/catch by design.

Alternatives and Why I Don't Like Them

To clarify, the question asks about

any input that would be recognizable by float()

Alternative #1 -- How about a regex

I find it hard to believe that you will get a regex to solve this problem in general. While a regex will be good at capturing float literals, there are a lot of corner cases. Look at all the cases on this answer - does your regex handle NaN? Exponentials? Bools (but not bool strings)?

Alternative #2: Manually Unrlodded Python Check:

To summarize the tough cases that need to be captured (which Python natively does)

Case insensitive capturing of Nan
Hex matching
All of the cases enumerated in the language specification
Signs, including signs in the exponent
Booleans

I also would point you to the case below floating points in the language specification; imaginary numbers. The floating method handles these elegantly by recognizing what they are, but throwing a type error on the conversion. Will your custom method emulate that behaviour?

Upvotes: -1

Mr. E

Reputation: 2120

If being pythonic is a justification then you should just stick to The Zen of Python. Specifically to this ones:

Explicit is better than implicit.

Simple is better than complex.

Readability counts.

There should be one-- and preferably only one --obvious way to do it.

If the implementation is hard to explain, it's a bad idea.

All those are in favour of the try-except approach. The conversion is explicit, is simple, is readable, is obvious and easy to explain

Also, the only way to know if something is a float number is testing if it's a float number. This may sound redundant, but it's not

Now, if the main problem is speed when trying to test too much supposed float numbers you could use some C extensions with cython to test all of them at once. But I don't really think it will give you too much improvements in terms of speed unless the amount of strings to try is really big

Edit:

Python developers tend to prefer the EAFP approach (Easier to Ask for Forgiveness than Permission), making the try-except approach more pythonic (I can't find the PEP)

And here (Cost of exception handlers in Python) is a comparisson between try-except approach against the if-then. It turns out that in Python the exception handling is not as expensive as it is in other languages, and it's only more expensive in the case that a exception must be handled. And in general use cases you won't be trying to validate a string with high probability of not being actually a float number (Unless in your specific scenario you have this case).

Again as I said in a comment. The entire question doesn't have that much sense without a specific use case, data to test and a measure of time. Just talking about the most generic use case, try-except is the way to go, if you have some actual need that can't be satisfied fast enough with it then you should add it to the question

Upvotes: 0

acdr

Reputation: 4706

To prove a point: there's not that many conditions that a string has to abide by in order to be float-able. However, checking all those conditions in Python is going to be rather slow.

ALLOWED = "0123456789+-eE."
def is_float(string):
    minuses = string.count("-")
    if minuses == 1 and string[0] != "-":
        return False
    if minuses > 1:
        return False

    pluses = string.count("+")
    if pluses == 1 and string[0] != "+":
        return False
    if pluses > 1:
        return False

    points = string.count(".")
    if points > 1:
        return False

    small_es = string.count("e") 
    large_es = string.count("E")
    es = small_es + large_es
    if es > 1:
        return False
    if (es == 1) and (points == 1):
        if small_es == 1:
            if string.index(".") > string.index("e"):
                return False
        else:
            if string.index(".") > string.index("E"):
                return False

    return all(char in ALLOWED for char in string)

I didn't actually test this, but I'm willing to bet that this is a lot slower than try: float(string); return True; except Exception: return False

Upvotes: -1

Python: validate whether string is a float without conversion

Answers (4)

Related Questions