What does the below error mean in PySpark?

Question

I am using this tutorial: http://spark.apache.org/docs/latest/quick-start.html to no avail.

I have tried the following:

textFile=sc.textFile("README.md")
textFile.count()

Below is the output that I receive instead of the desired result, 126.

> textFile=sc.textFile("README.md")
15/11/18 13:19:49 INFO MemoryStore: ensureFreeSpace(182712) called with curMem=2
54076, maxMem=556038881
15/11/18 13:19:49 INFO MemoryStore: Block broadcast_2 stored as values in memory
 (estimated size 178.4 KB, free 529.9 MB)
15/11/18 13:19:49 INFO MemoryStore: ensureFreeSpace(17179) called with curMem=43
6788, maxMem=556038881
15/11/18 13:19:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in
memory (estimated size 16.8 KB, free 529.8 MB)
15/11/18 13:19:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on l
ocalhost:61916 (size: 16.8 KB, free: 530.2 MB)
15/11/18 13:19:49 INFO SparkContext: Created broadcast 2 from textFile at null:-
2

> textFile.count()

Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 1006, in cou
nt
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 997, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 871, in fold

    vals = self.mapPartitions(func).collect()
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 773, in coll
ect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java
_gateway.py", line 538, in __call__
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\sql\utils.py", line 36, in
 deco
    return f(*a, **kw)
  File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\prot
ocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.
api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: fil
e:/C:/Users/Administrator/Downloads/spark-1.5.2-bin-hadoop2.4/spark-1.5.2-bin-ha
doop2.4/spark-1.5.2-bin-hadoop2.4/bin/README.md
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(Fil
eInputFormat.java:285)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
ava:228)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
va:304)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.
scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:5
8)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scal
a:405)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala
)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Unknown Source)

desertnaut · Accepted Answer

As @santon says, your input path does not exist; indeed, file README.md lies under Spark home directory, and not under $SPARK_HOME/bin. Here is the situation in Ubuntu:

 ~$ echo $SPARK_HOME
 /usr/local/bin/spark-1.5.1-bin-hadoop2.6
 ~$ cd $SPARK_HOME
 /usr/local/bin/spark-1.5.1-bin-hadoop2.6$ ls
 bin          conf  ec2       lib      NOTICE  R          RELEASE
 CHANGES.txt  data  examples  LICENSE  python  README.md  sbin

So, since README.md is not in your working directory, you should either provide the full path, or otherwise be sure that the file exists in your current working directory, which is where you have started pyspark from:

 /usr/local/bin/spark-1.5.1-bin-hadoop2.6$ ./bin/pyspark
 [...]
 >>> import os
 >>> os.getcwd()
 '/usr/local/bin/spark-1.5.1-bin-hadoop2.6'
 >>> os.listdir(os.getcwd())
 ['lib', 'LICENSE', 'python', 'NOTICE', 'examples', 'ec2', 'README.md', 'conf', 'CHANGES.txt', 'R', 'data', 'RELEASE', 'bin', 'sbin']

Now, your code will work, since README.md is in your working directory:

 >>> textFile=sc.textFile("README.md")
 [...]
 >>> textFile.count()
 [...]
 98

BTW, the correct answer is 98 (cross-checked) - not sure why the tutorial asks for 126.

Summarizing, use os.listdir(os.getcwd()) to be sure that the file you are looking for exists in your current working directory; if yes, you can use your code above unmodified; if not, you should either provide the full file path, or change your working directory using the appropriate Python commands.

What does the below error mean in PySpark?

Answers (1)

Related Questions