Reputation: 1005

Pyspark Dataframe: Unable to Save As Hive Table

I am trying to use 'SaveAsTable' on a dataframe - our hive metastore is in the external RDS and I am trying to store data in a S3 - but is failing with the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o87.saveAsTable. : java.net.NoRouteToHostException: No Route to Host from ip-10-174-20-142/10.174.20.142 to ip-10-174-26-239.ec2.internal:8020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost

Here is the complete code and error:

[hbohra@ip-10-174-20-142 ~]$ pyspark --files /etc/hive/conf/hive-site.xml 
Python 2.7.12 (default, Sep  1 2016, 22:14:00) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/01/30 21:05:39 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.2
      /_/

Using Python version 2.7.12 (default, Sep  1 2016 22:14:00)
SparkSession available as 'spark'.
>>> from pyspark.sql.functions import lit
>>> df = sqlContext.read.json('s3://dl-rawdata-dev/reporting/impressions/platform/module/context/2017-01-27T00-00')
>>> asset_df = df.withColumn('cust_id', df['key']['cust_id']).withColumn('platform', lit('platform')).withColumn('context', lit('context')).withColumn('module', lit('context')).withColumn('impressions',df['metric']['impressions']).withColumn('orders', df['metric']['orders']).withColumn('subtotal', df['metric']['subtotal']).withColumn('adfee', df['metric']['adfee']).withColumn('clicks', df['metric']['clicks']).withColumn('views', df['metric']['views']).withColumn('report_date', lit('2017-01-27')).drop('key').drop('metric')
>>> asset_df.write.format('orc').partitionBy('report_date', 'platform').saveAsTable('hbohra.reporting', path='s3://dl-data-assets-dev/hbohra.db/reporting/', mode='overwrite')
17/01/30 21:06:07 WARN command.CreateDataSourceTableUtils: Persisting partitioned data source relation `hbohra`.`reporting` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): 
s3://dl-data-assets-dev/hbohra.db/reporting
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 585, in saveAsTable
    self._jwrite.saveAsTable(name)
  File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o87.saveAsTable.
: java.net.NoRouteToHostException: No Route to Host from  ip-10-174-20-142/10.174.20.142 to ip-10-174-26-239.ec2.internal:8020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy12.delete(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:540)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy13.delete(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:2044)
    at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:707)
    at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:703)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:703)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:185)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:152)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:152)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)
    at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:152)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:226)
    at org.apache.spark.sql.execution.command.CreateDataSourceTableUtils$.createDataSourceTable(createDataSourceTables.scala:504)
    at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:259)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:378)
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.NoRouteToHostException: No route to host
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
    at org.apache.hadoop.ipc.Client.call(Client.java:1451)
    ... 48 more

Upvotes: 0

Answers (2)

Tony

Reputation: 1254

I think the issue here is that the table in the Hive metastore have been created using a different cluster.

We had the same issue, and the host mentioned in the stack trace was nowhere to be found in our EC2 dashboard.

This link explains that managed (not external) tables keep a reference to the filesystem uri (which contains the hostname that does not exist anymore).

Upvotes: 4

stevel

Reputation: 13480

Did you follow the wiki entry? Did it help?

update to anyone voting this down

The Hadoop network stack error messages were reworked a few years back so every low level network failure is caught and wrapped to add the bit Sun left out: src & dest hostnames and ports, and, at the same time, [we put in a wiki link. so that people who aren't familiar with things like NoRouteToHost have a nice list of common causes. People like myself add new ones when we find new ways to get the stack.

As a co-author of that patch, I really despair when people file bug reports etc about something not working, without actually following the wiki link. They are skipping a core piece of the self-help diagnostics and troubleshooting we've added to the OSS project.

As the wiki entries note, "your network, your problem". You are going to have to work out what's up, the (hostname, port) pairs are a good cue to work out the services not working, so it's time to use the low level tools like telnet and ping.

To close then: the exception includes diagnostics info and a link to a hadoop wiki entry on the failure. Follow the link

Upvotes: 2

Pyspark Dataframe: Unable to Save As Hive Table

Answers (2)

update to anyone voting this down

Related Questions