sthames42
sthames42

Reputation: 1019

Erlang/OTP node connectivity issues running under WSL2 and using "longnames" for the `name_domain`

Trying to RPC to another node from a script, everything works when using "shortnames" but fails when using "longnames".

Where my local machine name is "Pandora", and after starting a detached node,

erl -detached -noshell -name 'node1@Pandora' -setcookie pandemic

Running this script with name_domain => shortnames,

#!/usr/bin/env escript
-mode(compile).

-define(THIS_NODE, 'testnode@Pandora').
-define(THAT_NODE, 'node1@Pandora').

show(R) -> io:format("~p~n", [R]).

main(_) ->
  net_kernel:start(?THIS_NODE, #{name_domain => shortnames}),
  erlang:set_cookie(?THAT_NODE, pandemic),

  show( erl_epmd:names() ),  
  show( net_adm:names() ),

  show( net_adm:ping(?THAT_NODE) ),

  show( rpc:call(?THAT_NODE, erlang, time, []) ).

works correctly and produces this:

{ok,[{"node1",40965},{"testnode",35319}]}
{ok,[{"node1",40965},{"testnode",35319}]}
pong
{14,1,21}

However, when I change it to name_domain => longnames, to simulate working in a distributed environment):

  net_kernel:start(?THIS_NODE, #{name_domain => longnames}),

The test fails with an error report:

=ERROR REPORT==== 6-Oct-2024::14:09:07.875238 ===
** System running to use fully qualified hostnames **
** Hostname Pandora is illegal **

Clearly, "Pandora" is not an FQDN so I attempted to solve this by creating a local Inets Configuration file called erl_inetrc and setting the local domain to that of my office router:

{domain, "myoffice.loc"}.

And did so in my test script, as well:

-define(THIS_NODE, '[email protected]').
-define(THAT_NODE, '[email protected]').

I then set the location of the file in the ERL_INETRC environment variable:

export ERL_INETRC="$(pwd)/erl_inetrc"

Sadly, this resulted in my test script freezing up at the net_adm:names() command. erl_epmd:names() worked, however, which is odd given net_adm:names is supposed to call erl_epmd:names.

Anybody have any idea why net_adm:names() freezes up?

Upvotes: 0

Views: 60

Answers (1)

sthames42
sthames42

Reputation: 1019

Updated Answer

Turns out there is a much better answer:

  • sudo vi /etc/wsl.conf and add the fully-qualified hostname to the [network] section:

    [network]
    hostname="pandora.wsl"
    
  • You will need to restart WSL2 for this to take affect. Best to restart your computer but this will work, too (see step#2).

  • The local network configuration file required in the original answer is no longer necessary.

  • Do not call erl_epmd():names as it will now freeze up. Use net_adm:names(), instead (see Notes, below).

  • To confirm the change with my test script, this line must be removed or commented out to prevent it freezing up:

    show( erl_epmd:names() ),
    

Notes

This gets us nearly all the way to what we'd expect from running Erlang/OTP in a native Linux or Windows OS. But, it doesn't quite fix everything. I'm guessing the remaining issues are an implementation side-effect of WSL which resolves hostname to the localhost IP address of 127.0.0.1.

  • Since we are making hostname fully-qualified, instead of setting domainname separately, the hostname will no longer resolve without the domain name (Some have tried setting domainname to no effect).

  • In my question, I noted net_adm:names() froze after assigning the domain in a local config file. Yet, erl_epmd:names() still worked.

    After removing the local config file, and making hostname fully-qualified, erl_epmd:names() freezes while net_adm:names() works, correctly.

  • In my original answer, I noted using erl_call to terminate a node:
    erl_call -name '[email protected]' -c pandemic -q
    failed with an error:
    erl_call: can't ei_gethostbyname(Pandora.wsl)
    but the command did work using the short name:
    erl_call -sname 'node1' -c pandemic -q.

    After the change, using -name '[email protected]' works, without error, and using -sname node1 freezes up.

What the heck?

erl_epmd:names() retrieves the host name from inet:gethostname(), which strips the domain from the fully-qualified WSL hostname. This worked fine with the default hostname but now freezes because that name no longer resolves.

net_adm:names() retrieves the fully-qualified hostname from net_adm:localhost(), which appends the domain name to the host name returned from inet:gethostname(). That worked, before I made any changes, because there was no domain name to append. After I added the local config file, WSL didn't resolve Pandora.wsl. But, now it does.

It's worth noting erl_empd:names/0 is not included in the API docs and is probably not intended for use by the public. Calling erl_epmd:names/1 with the WSL fully-qualified hostname works fine.

Conclusion

This is certainly the better solution but it would be best for WSL to distinguish the host name separately from the domain name, as any other Linux distribution would. If anyone has figured out a way to make that happen, reliably, please share.


Original Answer

The solution turned out to be exceedingly simple but here's how I came to it:

  • Hostnames must resolve to an IP address. The documentation is not thoroughly clear on this but researching the net_adm:names() code confirmed it.

  • WSL2 runs in a virtual machine with NAT networking where the IP address assigned to the host is not the one assigned by the local router in the local domain.

  • When "shortnames" are used, everything runs on localhost which will always resolve correctly. For "longnames", hostnames must be fully-qualified with a domain and the FQDN must resolve to an IP address.

My office network address for myoffice.loc is 10.1.1.0/24. My WSL2 network is 172.16.32.0/24. When net_adm:names() resolved pandora.myoffice.loc to 10.1.1.119, it could not bind to the port used by the epmd daemon. But it failed to report any error and simply froze up.

It turns out setting the local domain to one that won't be found in DNS makes all hosts resolve to the localhost IP of 127.0.0.1. I have no idea why but it fixed my problem.

So I changed my local config file erl_inetrc to use a non-existent domain,

{domain, "wsl"}.

Changed the node domains in my test script,

-define(THIS_NODE, '[email protected]').
-define(THAT_NODE, '[email protected]').

Started my detached node with the FQDN,

erl -detached -noshell -name '[email protected]' -setcookie pandemic

And everything works as it should:

{ok,[{"node1",37267},{"testnode",39771}]}
{ok,[{"node1",37267},{"testnode",39771}]}
pong
{20,1,55}

Additional Notes

  • My detached test node is normally terminated from the command line with erl_call.
    In this case: erl_call -name '[email protected]' -c pandemic -q.

    But, when I ran this command with my dummy domain name, it produced an error:
    erl_call: can't ei_gethostbyname(Pandora.wsl).

    However, the same command will work by using the short name (go figure):
    erl_call -sname 'node1' -c pandemic -q

  • I wasn't able to make WSL2 return the IP it assigns to the host for [email protected]. I found a way around it but it will obviously only work for Erlang/OTP. While I didn't use any of them, here a couple of networking solutions that show promise.

Upvotes: 0

Related Questions