C++ String Conversion: Exploring std:from_chars in C++17 to C++26

77 points by jandeboevrie 9 months ago

Caution with these functions: in most cases you need to check not only the error code, but also the `ptr` in the result. Otherwise you end up with `to_int("0x1234") == 0` instead of the expected `std::nullopt`, because these functions return success if they matched a number at the beginning of the string.

pilif 9 months ago

how can this be the ~5th iteration of a very wide-spread use-case and still contain a footgun?
The API looks like it's following best-practice, with a specific result type that also contains specific error information and yet, that's not enough and you still end up with edge-cases where things look like they're fine when they aren't.
- amelius 9 months ago
  
  I suppose the reason is that sometimes you want to parse more than just the number. For example, numbers separated by commas. In that case you will have to call the function repeatedly from your own parsing routine and advance the pointer.
  - gpderetta 9 months ago
    
    Yes, if you are parsing in-place numbers in an existing string you do not necessarily know a-priori where the number end. You could have extra logic to look ahead to the end of the number before passing it from_chars, but a) it would require an additional pass and b) you could end up replicating part of the implementation of from chars.
    from_chars is supposed to be a low lever routine, it must be possible to use it correctly, but it is not necessarily the most ergonomic.
- jdbdndj 9 months ago
  
  I don't see a foot gun, you just need to check if you have consumed the whole input. Which is the norm in nearly any streaming api
  - randomNumber7 9 months ago
    
    I didn't see a foot gun either, but somehow my foot went missing.
  - pilif 9 months ago
    
    I get that this is a super low level API, but yet, my expectation about an API that parses a buffer with length to a number and which has a specific enum for error cases as its return type would be that when asked to parse "9not a number" that I would get an error response and not a "the number you asked me to parse is 9 and everything went well" as the result.
    The whole reason for returning a result enum would be so I don't forget to also check somewhere else whether something could possibly have gone wrong.
    Especially when there is a specific result type.
    
    gpderetta 9 months ago
    
    But now you need a different API to compute where a number starts and end and that API must use exactly the same definition of number.
    
    pilif 9 months ago
    
    There could be an enum value for "read a partial number"
  - mort96 9 months ago
    
    In what way is it streaming? It takes at least the entire numeric string as a string in memory, you can't stream in more data as needed
- khwrht 9 months ago
  
  That is a good question. The C++ stdlib has some truly bizarre APIs. I wonder if they should freeze std and work on std2.
  - otabdeveloper4 9 months ago
    
    from_chars is the correct API here. When you're parsing decimal numbers you want to do it with streaming semantics.
    
    mort96 9 months ago
    
    Hm but there's nothing streaming about it? You need the entire numeric string in memory
    
    Leherenn 9 months ago
    
    I think they meant the other way around: you can have a single string in memory containing multiple numbers, and process that string as a stream.

captainmuon 9 months ago

I wonder why it is called `from_chars` and not `to_number` or similar. It's obvious what you are converting from, because you have to write down the argument `from_chars(someString)`, but you don't see what is coming out.

Someone 9 months ago

As you indicate, you do see what you’re putting in, and that includes an argument holding a reference to the result of the conversion.
What’s coming out is a std::from_chars_result: a status code plus an indicator how much of the data was consumed.
What to name this depends on how you see this function. As part of a function on number types, from_chars is a good names. As part of a function on strings, to_int/to_long/etc are good names. As freestanding functions, chars_to_int (ugly, IMO), parse_int (with parse sort-of implying taking a string) are options.
I can see why they went for from_chars. Implementations will be more dependent on the output type than on character pointers, it’s more likely there will be more integral types in the future than that there will be a different way to specify character sequences, and it means adding a single function name.
beyondCritics 9 months ago

As a matter of fact, this overloads the name and hence gives less hassle for generic coding.
pavlov 9 months ago

Maybe “number” is too ambiguous because they’d have to define that “in this context a number means a float or integer type only.” The C++ standard also includes support for complex numbers.

vaylian 9 months ago

Example from the website:

  const std::string str { "12345678901234" };
  int value = 0;
  std::from_chars(str.data(),str.data() + str.size(), value);

On the third line: Why can I just pass in `value` like this? Shouldn't I use `&value` to pass in the output variable as a reference?

_gabe_ 9 months ago

I saw another commenter explain that it’s passed by reference, but I agree with you. The C++ Core Guidelines even mention that it’s better to use raw pointers (or pass by value and return a value) in cases like this to make the intent clear.
https://isocpp.org/wiki/faq/references#call-by-reference
- nmeofthestate 9 months ago
  
  A pointer parameter can be null and it doesn't make sense for this parameter to be null, so IMO a reference is the better choice here.
  A non-const reference is just as clear a signal that the parameter may be modified as a non-const pointer. If there's no modification const ref should be used.
  - jdashg 9 months ago
    
    It's about clarity of intent at the call site. Passing by mutable ref looks like `foo`, same as passing by value, but passing mutability of a value by pointer is textually readably different: `&foo`. That's the purpose of the pass by pointer style.
    You could choose to textually "tag" passing by mutable ref by passing `&foo` but this can rub people the wrong way, just like chaining pointer outvars with `&out_foo`.
    
    jcelerier 9 months ago
    
    If you want clarity of intent define dummy in and out macros but please don't make clean reference-taking APIs a mess by turning them into pointers for no good reason
  - gpderetta 9 months ago
    
    In theory a from_char with an optional output parameter could be useful to validate that the next field is a number and/or discard it without needing to parse it; it might even be worth optimizing for that case.
- cornstalks 9 months ago
  
  nit: I don't think the Core Guidelines actually suggests it's better. It's just "one style." There are pros and cons to both styles.
hardlianotion 9 months ago

In the function signature, value is a reference. You can think of reference as being a pointer that points to a fixed location but with value semantics.
So you can dip value into the function call and the function can assign to it, as it could to data pointed to by a pointer.
trealira 9 months ago
&value would be a pointer to that integer. Instead, it's using references, which also use the & character. References use different syntax, but they're like pointers that can't be reassigned. Examples:
```
  int x = 0;
  int &r = x;
  r += 1;
  assert(x == 1);
  int y;
  r = y; // won't compile

  void inc(int &r) { r += 1; }
  int x = 0;
  inc(x);
  assert(x == 1);
```
The equivalent using pointers, like in C:
```
  int x = 0;
  int *p = &x;
  *p += 1;
  assert(x == 1);
  int y;
  p = &y;

  void inc(int *p) { *p += 1; }
  int x = 0;
  inc(&x);
  assert(x == 1);
```
- trealira 9 months ago
  Mistake:
  r = y; // won't compile
  This will compile. It will be effectively the same as "x = y". The pointer equivalent is *p = y".
  My apologies, as it's been a while since I've used C++.

Dwedit 9 months ago

What if you try to convert a French floating point number that uses a comma instead of a dot?

jeffbee 9 months ago

Then the programmer has made a mistake, because the behavior is the same as `strtod` in C locale, i.e. it stops parsing at the first comma.
You should think of `from_chars` as a function that accepts the outputs of `to_chars`, not as a general text understander.

einpoklum 9 months ago

Note that this won't work (AFAICT) with Unicode strings and non-western-arabic digits, e.g.:

    std::u8string_view chars{ u8"۱۲۳٤" };
    int value;
    enum { digit_base = 10 };
    auto [ptr, ec] = std::from_chars(
       chars.data(), chars.data() + chars.size(), value, digit_base);
    return (ec == std::errc{}) ? value : -1;

will fail to compile due to pointer incompatibility.

jcelerier 9 months ago

std::from_chars / std::to_chars are explicitly made to only operate in the C locale, so basically ASCII texts. It's not meant for parsing user-input strings but rather text protocols with maximal efficiency (and locale support prevents that).
E.g. "۱۲۳٤" isn't as far as I know a valid number to put in a json file, http content-length or CSS property. Maybe it'd be ok in a CSV but realistically have you ever seen a big data csv encoded in non-ascii numbers?
- gpderetta 9 months ago
  
  This way your spreadsheet could interpret the data it loads from a CVS differently depending on on the locale of the machine it is running on.
  Of course nobody would design an application like that.
- einpoklum 9 months ago
  
  > have you ever seen a big data csv encoded in non-ascii numbers?
  Let me ask you this: How much data have you processed which comes from human user input in Arabic-speaking countries?
  - jcelerier 9 months ago
    
    If your CSV is defined to contain straight, unparsed user input it's wrong no matter the context. If it's defined to contain numbers then if at some point between [user input] and [csv output] you don't have a pass where the value is parsed, validated and converted to one of your programming language's actual number data types before being passed to your CSV writer, then your code is wrong.
cout 9 months ago

What would you suggest instead?

cherryteastain 9 months ago

Wish they returned std::expected<T, std::errc> instead of the weird from_chars_result struct

Longhanks 9 months ago

std::expected is new in C++23, std::from_chars was introduces in C++17. Obviously, 2023 features were not available in 2017. Changing the return type now would break everybody's code, with very little benefit, you can easily wrap std::from_chars.
- jlarocco 9 months ago
  
  Having an inconsistent, special case way of doing something in the name of backwards compatibility is the defining characteristic of C++.
  - Night_Thastus 9 months ago
    
    It's a tradeoff.
    For example: With Python Python breaking changes are more common, and everyone complains about how much they have to go and fix every time something changes.
    Damned if you do, damned if you don't. Either have good backwards compatibility but sloppy older parts of the language - or have a 'cleaner' overall language at the cost of crushing the community every time you want to change anything.
    
    jlarocco 9 months ago
    
    I'm fully aware, just pointing out that C++ is particularly afflicted with backward compatibility issues; far more than other languages.
    "The Design and Evolution of C++" gives the impression that even back in the 80s, major concessions were being made in the name of compatibility. At that time it was with C; now it's with previous versions of C++.
- criddell 9 months ago
  
  If returning std::expected makes more sense, why not make it the primary signature and create a wrapper to maintain compatibility with old code?
  - Thorrez 9 months ago
    
    Then you would have to use 2 names: the primary name and the wrapper name. What would they be? Using 2 names wastes more of the namespace, and will confuse people. If the wrapper name isn't from_chars, then people's code will break when upgrading.
    
    criddell 9 months ago
    
    Oh right. A different return type isn't enough to differentiate one function from another.
- cherryteastain 9 months ago
  
  They could add an overload like std::expected<T, std::errc> from_chars(std::string_view). That way, since the input arguments are different, there'd be no issues about overload resolution.
  - gpderetta 9 months ago
    
    But overloads providing slightly different interfaces are a bane to generic programming.
    Overloads are already confusing, if they can't be used generically there is really no point in reusing the name.
eMSF 9 months ago

from_chars_result closely matches the strtol line of functions we've had for decades. Not returning the end position would be weird here!

lynx23 9 months ago

Whoever introduced the rule to automatically delete :: in titles on a hacker site should be made to rethink their decisions. Its a silly rule. It should go.

pjmlp 9 months ago

Most of those silly rules can be overriden after submission, there is a timeout to enforce the desired title instead by editing it.
MathMonkeyMan 9 months ago

It would be an interesting piece to learn why.
Maybe it was for ::reasons::.
- lynx23 9 months ago
  
  I am sure the intern who wrote a rule for the Apple VoiceOver speech synthesizer to special case the number 18 being pronounced english while the synth voice is set to german imagined to have a good reason at the time as well. However, that desn't make ther decision less stupid. "Vierzehn Uhr", "Fünfzehn Uhr", "Sechzehn Uhr", "Siebzehn Uhr", "Eighteen Uhr".

criddell 9 months ago

The author lists sprintf as one of the ways you can convert a string to a numbers. How would that work?

murderfs 9 months ago

It's not a list of ways to convert strings to numbers, it's a list of string conversion functions (i.e. including the other direction). to_string is also listed there.
randomNumber7 9 months ago

Did you mean sscanf? You could for example parse a number into a double variable by using "sscanf(str, "%d", &x)"
You can even parse a whole line of a csv file with multiple numbers in one call "sscanf(str, "%d;%d;%d\n, &d1, &d2, &d3)"
- criddell 9 months ago
  
  No, the article lists sprintf/snprintf.
  Another person already correctly pointed out that the author was listing a bunch of functions for number and string conversions in general.
  - randomNumber7 9 months ago
    
    printf makes no sense in this context, because the conversion happens in the other direction.
    The article doesn't change that. It also lists scanf...

userbinator 9 months ago

Wasn’t the old stuff good enough? Why do we need new methods? In short: because from_chars is low-level, and offers the best possible performance.

That sounds like marketing BS, especially when most likely these functions just call into or are implemented nearly identically to the old C functions which are already going to "offers the best possible performance".

I did some benchmarks, and the new routines are blazing fast![...]around 4.5x faster than stoi, 2.2x faster than atoi and almost 50x faster than istringstream

Are you sure that wasn't because the compiler decided to optimise away the function directly? I can believe it being faster than istringstream, since that has a ton of additional overhead.

After all, the source is here if you want to look into the horse's mouth:

https://raw.githubusercontent.com/gcc-mirror/gcc/master/libs...

Not surprisingly, under all those layers of abstraction-hell, there's just a regular accumulation loop.

deeringc 9 months ago

You might want to watch this releavnt video from Stephan T. Lavavej (the Microsoft STL maintainer): https://www.youtube.com/watch?v=4P_kbF0EbZM
- userbinator 9 months ago
  
  I don't need to listen to what someone says if I can look at the source myself.
  - deeringc 9 months ago
    
    I believe the impl you link to is not fully standards compliant, and has an approximate soln.
    MSFT's one is totally standards compliant and it is a very different beast: https://github.com/microsoft/STL/blob/main/stl/inc/charconv
    Apart from various nuts and bolts optimizations (eg not using locales, better cache friendless, etc...) it also uses a novel algorithm which is an order of magnitude quicker for many floating points tasks (https://github.com/ulfjack/ryu).
    If you actually want to learn about this, then watch the video I linked earlier.
  - secondcoming 9 months ago
    
    You profiled the code in your head?
ftrobro 9 months ago

Integers are simple to parse, but from_chars is a great improvement when parsing floats. It's more standardized on different platforms than the old solutions (no need to worry about the locale, for example whether to use comma or dot as decimals separator) but also has more reliable performance in different compilers. The most advanced approaches to parsing floats can be surprisingly much faster than intermediately advanced approaches. The library used by GCC since version 12 (and also used by Chrome) claims to be 4 - 10 times faster than old strtod implementations:
https://github.com/fastfloat/fast_float
For more historical context:
https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...
flqn 9 months ago

They're locale independent, which the C stol, stof, etc functions are not.
- a1369209993 9 months ago
  
  Yes, exactly. Which means that, while the speed gains are real, they only apply in cases where your libc is dangerously defective.
OvbiousError 9 months ago

I agree with some of this, and the author could've made a better case for from/to_chars:
- Afaik stoi and friends depend on the locale, so it's not hard to believe this introduced additional overhead. The implicit locale dependency is also often very surprising.
- std::stoi only accepts std::string as input, so you're forced to allocate a string to use it. std::from_chars does not.
- from/to_chars don't throw. As far as I know this won't affect performance if it doesn't happen, it does mean you can use these functions in environments where exceptions are disabled.
- eptcyka 9 months ago
  
  Locale env stuff is inherently thread unsafe, which is the main reason to never rely on it.
- deeringc 9 months ago
  
  There's also the new Ryu algorithm that is being used, which is probably the biggest speed up.
  https://github.com/ulfjack/ryu
  - badmintonbaseba 9 months ago
    
    AFAIK the state of the art now is "dragonbox":
    https://github.com/jk-jeon/dragonbox
majoe 9 months ago

A few months ago I optimized the parsing of a file and did some micro benchmarks. I observed a similar speed-up compared to stoi and atoi (didn't bother to look at stringstream). Others already commented, that it's probably due to not supporting locales.
dexen 9 months ago

For sake of example: a "locale-aware" number conversion routine would be the worst possible choice for parsing incoming network traffic. Beyond the performance concerns, there's the significant semantic difference in number formatting across cultures. Different conventions of decimal or thousands coma easily leading to subtle data errors or even security concerns.
Lastly, having a simple and narrowly specified conversion routines allows one to create a small sub-set of C++ standard library fit for constrained environments like embedded systems.
- doug_durham 9 months ago
  
  I get that. However then they should name the function and put highly visible disclaimers in the documentation. Something like "from_ascii" instead of "from_chars". Also the documentation, including this blog post should be very clear that this function is only suitable for parsing machine to machine communications and should never be used for human input data. There is clearly a place for this type of function, however this blog post miscommunicates this in a potentially harmful way. When I read the post I presumed that this was a replacement for atoi() even though it had a confusing "non-locale" bullet point.

blux 9 months ago

Did you verify their claims or are you just calling BS and that's it? The new functions are in fact much faster than their C equivalent (and yes, I did verify that).

userbinator 9 months ago

Care to explain and show the details?

"Extraordinary claims require extraordinary evidence."

nmeofthestate 9 months ago

Your original claim "I've not checked but this guy, and by extension the C++ standards committee who worked on this new API, are probably full of shit" was pretty extraordinary.
- userbinator 9 months ago
  
  Look at the compiler-generated instructions yourself if you don't believe the source that I linked; in the cases I've seen all the extra new stuff just adds another layer on top of existing functions and if the former are faster the latter must necessarily also be.
  The standards committee's purpose is to justify their own existence by coming up with new stuff all the time. Of course they're going to try to spin it as better in some way.
j16sdiz 9 months ago

How not?
It compiles from sources, can be better in-lined, benefits from dead code elimination when you don't use unusual radix. It also don't do locale based things.

blux 9 months ago

I wrote this library once; https://github.com/ton/fast_int.

Removed `std::atoi` from the benchmarks since it was performing so poorly; not a contender. Should be easy to verify.

Rough results (last column is #iterations):

  BM_fast_int<std::int64_t>/10                  1961 ns         1958 ns       355081
  BM_fast_int<std::int64_t>/100                 2973 ns         2969 ns       233953
  BM_fast_int<std::int64_t>/1000                3636 ns         3631 ns       186585
  BM_fast_int<std::int64_t>/10000               4314 ns         4309 ns       161831
  BM_fast_int<std::int64_t>/100000              5184 ns         5179 ns       136308
  BM_fast_int<std::int64_t>/1000000             5867 ns         5859 ns       119398
  BM_fast_int_swar<std::int64_t>/10             2235 ns         2232 ns       316949
  BM_fast_int_swar<std::int64_t>/100            3446 ns         3441 ns       206437
  BM_fast_int_swar<std::int64_t>/1000           3561 ns         3556 ns       197795
  BM_fast_int_swar<std::int64_t>/10000          3650 ns         3646 ns       188613
  BM_fast_int_swar<std::int64_t>/100000         4248 ns         4243 ns       165313
  BM_fast_int_swar<std::int64_t>/1000000        4979 ns         4973 ns       140722
  BM_atoi<std::int64_t>/10                     10248 ns        10234 ns        69021
  BM_atoi<std::int64_t>/100                    10996 ns        10985 ns        63810
  BM_atoi<std::int64_t>/1000                   12238 ns        12225 ns        56556
  BM_atoi<std::int64_t>/10000                  13606 ns        13589 ns        51645
  BM_atoi<std::int64_t>/100000                 14984 ns        14964 ns        47046
  BM_atoi<std::int64_t>/1000000                16226 ns        16206 ns        43279
  BM_from_chars<std::int64_t>/10                2162 ns         2160 ns       302880
  BM_from_chars<std::int64_t>/100               2410 ns         2407 ns       282778
  BM_from_chars<std::int64_t>/1000              3309 ns         3306 ns       208070
  BM_from_chars<std::int64_t>/10000             5034 ns         5028 ns       100000
  BM_from_chars<std::int64_t>/100000            6282 ns         6275 ns       107023
  BM_from_chars<std::int64_t>/1000000           7267 ns         7259 ns        96114
  BM_fast_float<std::int64_t>/10                2670 ns         2666 ns       262721
  BM_fast_float<std::int64_t>/100               3547 ns         3542 ns       196704
  BM_fast_float<std::int64_t>/1000              4643 ns         4638 ns       154391
  BM_fast_float<std::int64_t>/10000             5056 ns         5050 ns       132722
  BM_fast_float<std::int64_t>/100000            6207 ns         6200 ns       111565
  BM_fast_float<std::int64_t>/1000000           7113 ns         7105 ns        98847

adev_ 9 months ago

> Not surprisingly, under all those layers of abstraction-hell, there's just a regular accumulation loop.
Your dismissive answer sounds so much like the one of a typical old-C style programmer that underestimate by 2 order of magnitude what compiler inlining can do.
Abstraction, genericity and inlining on a function like from_chars is currently exactly what you want.
- userbinator 9 months ago
  
  It's my experience that says inlining only looks great in microbenchmarks but is absolutely horrible for cache usage and causes other things to become slower.
  - adev_ 9 months ago
    
    Which is wrong on almost any modern architecture.
    For small size functions inlining is almost always preferable because (1) the prefetcher actually love that and (2) a cache-miss due to a mis predicted jump is way more costly than anything a bit of bloat will ever cost you.
j16sdiz 9 months ago

Enabling new static optimization is a good, no?
halayli 9 months ago

youre answer shows dunning-kruger is full effect.