Reputation: 743
I'm rewriting a C++ parser in Rust for a legacy ASCII data format. Real number values in this format are permitted to be stored in any Fortran recognized format. Unfortunately, Fortran recognizes some formats not recognized by Rust (or most other languages). For example, the value 101.01 might be represented as
The first three are all natively recognized by Rust. The remaining four pose a challenge. In C++, we use the following routine to parse these values:
double parse(const std::string& s){
char* p;
const double significand = strtod(&s[0], &p);
const long exponent = (*p == '\0') ?
0 : isalpha(*p) ?
strtol(p+1, nullptr) :
strtol(p, nullptr);
return significand * pow(10, exponent);
}
After reviewing the Rust documentation, it doesn't appear that the standard library offers partial string parsing in the vein of strtod
and strtol
. I'd like to avoid taking multiple passes over the string or using regular expressions for performance reasons.
Upvotes: 4
Views: 518
Reputation: 60207
Your example in C++ does not give perfectly accurate results, but Rust's float parsing is intended to be perfectly accurate, and as such has slower parsing than you might need.
If you implement approximate parsing manually, it will likely come out a faster than any other technique available. A quick test I did locally suggests you can easily get a factor of 5 over the performance of the standard library's parse
method.
If you rather wish to have exact parsing, your C++ code is insufficient. A pre-parse (eg. with Regex) is likely the easiest way to do this, but alternatively you can rip out the code from the standard library and modify that.
Upvotes: 2
Reputation:
This would have been a comment to Veedrac's answer, but it got a bit long for a comment.
As Veedrac explains, parsing floats accurately is hard. The implementation in the standard library is completely accurate and reasonably well optimized. In particular, it's not much slower than the naive inaccurate algorithm for most inputs where the naive algorithm works. You should use it. Full disclaimer: I wrote it.
Where I disagree with Veedrac is how to proceed if you want to reuse that code. Ripping it out from the standard library is a bad idea. It's huge, about 2.5k lines of code, and it still changes/is improved occasionally — although rarely and mostly in very minor ways. But one day I'll find the time to rewrite the slow path to be better and faster, promised. If you rip out this code, you would have to take the core::num::dec2flt
module and modify the parse
submodule to recognize other exponents. Of course then you won't automatically benefit from future improvements, which is a shame if you're interested in performance.
The sanest way would be translate the other formats to the format supported by Rust. If it's a d
, D
or a bare +
you can simply replace it with an e
and pass it on to string . Only in the case 1010.1-1
you will need to insert an e
and shift the exponent part of the string. This should not cost much performance. Float strings are short (at most 20 or so bytes, often much less) and the actual conversion work does a good chunk of work per byte. This is true for your C++ code as well, because strtod
is accurate in glibc too. Or at least it's trying to be, it can't fix the ad hoc algorithm built around it. In any case, it is trying to .
Another possibility is to use FFI to call C's strtod
. Use the libc crate and call libc::strtod
. This requires some contortions to translate from &str
to raw pointers to c_char
, and it will handle interior 0 bytes badly, but the code you show is not terribly robust anyway. This would allow you to translate your algorithm to Rust with identical performance and semantics and (in)accuracy.
Upvotes: 5