user2108150
user2108150

Reputation:

Why doesn't g++ generate "raw" symbols?

From C we know what legal variable names are. The general regex for the legal names looks similar to [\w_](\w\d_)*.

Using dlsym we can load arbitrary strings, and C++ mangles names that include @ in the ABI..

My question is: can arbitrary strings be used? The documentation on dlsym does not seem to mention anything.

Another question that came up appears to imply that it is fully possible to have arbitrary null-terminated symbols. This inquires me to ask the following question:

Why doesn't g++ emit raw function signatures, with name and parameter list, including namespace and class membership?

Here's what I mean:

namespace test {
class A
{
    int myFunction(const int a);
};
}

namespace test {
int A::myFunction(const int a){return a * 2;}
}

Does not get compiled to

int ::test::A::myFunction(const int a)\0

Instead, it gets compiled to - on my 64 bit machine, using g++ 4.9.2 -

0000000000000000 T _ZN4test1A10myFunctionEi

This output is read by nm. The code was compiled using g++ -c test.cpp -o out

Upvotes: 7

Views: 272

Answers (5)

Lightness Races in Orbit
Lightness Races in Orbit

Reputation: 385144

(In this answer I ignore that you made several typos in your example of ::test::A::void myFunction(const int a)).

This format is:

  • not programmer-specific; consider that all these are the same, so why confuse people:
    • int ::test::A::myFunction(const int)
    • int ::test::A::myFunction(int const)
    • int test::A::myFunction(int const)
    • int test :: A :: myFunction (int const)
    • and so on…
  • unambiguous
  • terse; no parameter names or other unnecessary decorations
  • easier to parse (notice that the length of each component is present as a number)

Meanwhile, I see no benefit at all in choosing a human-readable looks-like-C++ format for a C++ ABI. This stuff is supposed to be optimised for machines. Why would you make it less optimal for machines, in order to make it more optimal for humans? And probably failing at the latter whilst doing so.

You say that your compiler does not emit "raw symbols". I posit that it does precisely that.

Upvotes: 1

DevSolar
DevSolar

Reputation: 70263

You basically answered your own question:

The general regex for the legal names looks similar to [\w_](\w\d_)*.

From the beginning, C++ used preexisting (C) linker / loader technology. There is nothing "C++" about either ld, ld-linux.so etc.

So linking is limited to what was legal in C already. That does not include colons, parenthesis, ampersands, asteriskes, and whatever else you would need to encode C++ identifiers in plain text.

Upvotes: 1

ivan_pozdeev
ivan_pozdeev

Reputation: 35998

  1. Because of limitations on the exported names imposed by a linker (and that includes the OS's dynamic linker) - character set, length. The very phenomenon of mangling arose because of this.
    • Corollary: in media where these limitations don't exist (various VMs that use their own linkers: e.g. .NET, Java), mangling doesn't exist, either.
  2. Each compiler that produces exports that are incompatible with others must use a different scheme. Because linker (static or dynamic) doesn't care about ABIs, all it cares about is identifiers.

Upvotes: 1

Mark B
Mark B

Reputation: 96241

I'm sure this decision was pragmatically made to avoid having to make any changes to pre-existing C linkers (quite possibly even originated from cfront). By emitting symbols with the same set of characters the C linker is used to you don't have to possibly make any number of updates and can use the linker off the shelf.

Additionally C and C++ are widely portable languages and they wouldn't want to risk breaking a more obscure binary format (perhaps on an embedded system) by including unexpected symbols.

Finally since you can always demangle (with something like gc++filt for example) it probably didn't seem worth using a full text representation.

P.S. You would absolutely not want to include the parameter name in the function name: People will not be happy if renaming a parameter breaks ABI. It's hard enough to keep ABI compatibility already.

Upvotes: 5

5gon12eder
5gon12eder

Reputation: 25409

GCC is compliant with the Itanium C++ ABI. If your question is “Why does the Itanium C++ ABI require names to be mangled that way?” then the answer is probably

  1. because its designers thought this would b a good idea and
  2. shorter symbols make for smaller object files and faster dynamic linking.

For the second point, there is a pretty good explanation in Ulrich Drepper's article How To Write Shared Libraries.

Upvotes: 1

Related Questions