Parsing floats in Rust from Fortran formats

Question

I'm rewriting a C++ parser in Rust for a legacy ASCII data format. Real number values in this format are permitted to be stored in any Fortran recognized format. Unfortunately, Fortran recognizes some formats not recognized by Rust (or most other languages). For example, the value 101.01 might be represented as

101.01
1.0101E2
101.01e0
101.01D0
101.01d0
101.01+0
1010.1-1

The first three are all natively recognized by Rust. The remaining four pose a challenge. In C++, we use the following routine to parse these values:

double parse(const std::string& s){
  char* p;
  const double significand = strtod(&s[0], &p);
  const long exponent = (*p == '\0') ? 
                          0 : isalpha(*p) ?
                            strtol(p+1, nullptr) :
                              strtol(p, nullptr);
  return significand * pow(10, exponent);
}

After reviewing the Rust documentation, it doesn't appear that the standard library offers partial string parsing in the vein of strtod and strtol. I'd like to avoid taking multiple passes over the string or using regular expressions for performance reasons.

user395760 · Accepted Answer

This would have been a comment to Veedrac's answer, but it got a bit long for a comment.

As Veedrac explains, parsing floats accurately is hard. The implementation in the standard library is completely accurate and reasonably well optimized. In particular, it's not much slower than the naive inaccurate algorithm for most inputs where the naive algorithm works. You should use it. Full disclaimer: I wrote it.

Where I disagree with Veedrac is how to proceed if you want to reuse that code. Ripping it out from the standard library is a bad idea. It's huge, about 2.5k lines of code, and it still changes/is improved occasionally — although rarely and mostly in very minor ways. But one day I'll find the time to rewrite the slow path to be better and faster, promised. If you rip out this code, you would have to take the core::num::dec2flt module and modify the parse submodule to recognize other exponents. Of course then you won't automatically benefit from future improvements, which is a shame if you're interested in performance.

The sanest way would be translate the other formats to the format supported by Rust. If it's a d, D or a bare + you can simply replace it with an e and pass it on to string . Only in the case 1010.1-1 you will need to insert an e and shift the exponent part of the string. This should not cost much performance. Float strings are short (at most 20 or so bytes, often much less) and the actual conversion work does a good chunk of work per byte. This is true for your C++ code as well, because strtod is accurate in glibc too. Or at least it's trying to be, it can't fix the ad hoc algorithm built around it. In any case, it is trying to .

Another possibility is to use FFI to call C's strtod. Use the libc crate and call libc::strtod. This requires some contortions to translate from &str to raw pointers to c_char, and it will handle interior 0 bytes badly, but the code you show is not terribly robust anyway. This would allow you to translate your algorithm to Rust with identical performance and semantics and (in)accuracy.

Parsing floats in Rust from Fortran formats

Answers (2)

Related Questions