unique_beast
unique_beast

Reputation: 1480

Error Loading mllib sample data into PySpark

Trying to load in some of the sample data into PySpark for Spark 1.3.0's MLlib example for RandomForests and am getting the errors below. I am new to MLlib and am uncertain how to examine this error further.

Code: https://spark.apache.org/docs/1.3.0/mllib-ensembles.html

Error:

data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
15/10/28 15:46:27 INFO storage.MemoryStore: ensureFreeSpace(100612) called with curMem=213451, maxMem=278302556
15/10/28 15:46:27 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 98.3 KB, free 265.1 MB)
15/10/28 15:46:28 INFO storage.MemoryStore: ensureFreeSpace(22935) called with curMem=314063, maxMem=278302556
15/10/28 15:46:28 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 22.4 KB, free 265.1 MB)
15/10/28 15:46:28 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:43188 (size: 22.4 KB, free: 265.4 MB)
15/10/28 15:46:28 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/10/28 15:46:28 INFO spark.SparkContext: Created broadcast 1 from textFile at NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/mllib/util.py", line 120, in loadLibSVMFile
    numFeatures = parsed.map(lambda x: -1 if x[1].size == 0 else x[1][-1]).reduce(max) + 1
  File "/usr/lib/spark/python/pyspark/rdd.py", line 740, in reduce
    vals = self.mapPartitions(func).collect()
  File "/usr/lib/spark/python/pyspark/rdd.py", line 701, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o49.collect.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://nameservice1/user/aowens/data/mllib/sample_libsvm_data.txt
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1511)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
    at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:312)
    at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

Upvotes: 1

Views: 786

Answers (1)

eliasah
eliasah

Reputation: 40380

According to your error log, the input path you have provided e.g hdfs://nameservice1/user/aowens/data/mllib/sample_libsvm_data.txt does not exist.

You need to make sure the path exists.

Upvotes: 3

Related Questions