Eugene
Eugene

Reputation: 11085

Set Ceph endpoint to DNS doesn't work in Hadoop

I'm trying to enabling big data environment which contains Hadoop (2.7), Spark(2.3) and Ceph(luminous). Before changing fs.s3a.endpoint to Domain Name, everything worked fine just as expected.

The key part of core-site.xml is like below:

<property>
    <name>fs.defaultFS</name>
    <value>s3a://tpcds</value>
</property>
<property>
        <name>fs.s3a.endpoint</name>
        <value>http://10.1.2.213:8080</value>
</property>

However, when I changed the fs.s3a.endpoint to Domain Name like below:

<property>
        <name>fs.s3a.endpoint</name>
        <value>http://gw.gearon.com:8080</value>
</property>

And I tried to launch SparkSQL on the Hadoop Yarn, the error like below throws:

AmazonHttpClient:448 - Unable to execute HTTP request: tpcds.gw.gearon.com: Name or service not known
java.net.UnknownHostException: tpcds.gw.gearon.com: Name or service not known
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1277)

The gw.gearon.com is forwarded to 10.1.2.213 for sure. After googling, I realized one more attribute should be set.

<property>
  <name>fs.s3a.path.style.access</name>
  <value>true</value>
  <description>Enable S3 path style access ie disabling the default virtual hosting behaviour.
    Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.
  </description>
</property>

After setting fs.s3a.path.style.access to true, the error disappears when launching Hadoop Map-Reduce. However, for Spark-SQL on Hadoop Yarn, the error still exists. I thought maybe Spark overrides Hadoop's settings, so I also append spark.hadoop.fs.s3a.path.style.access true to spark-defaults.xml, it still doesn't work.

So here come to the question: The endpoint I set is http://gw.gearon.com:8080, why the error showed me tpcds.gw.gearon.com is unknown? The tpcds is my Ceph bucket name I set it as my fs.defaultFS, it looks fine in core-site.xml. How can I solve the issue?

Any comment is welcomed and thanks for your help in advance.

Upvotes: 0

Views: 377

Answers (1)

dodger
dodger

Reputation: 101

You should use "amazon naming methods", as described here and here.

That is, point a wildcard dns CNAME to the name of the gateway(s):

*.gw.gearon.com CNAME 10.1.2.213

Also be sure to properly setup that name into the gateways (documentation here):

rgw dns name = clover.voxelgroup.net

Upvotes: 1

Related Questions