Reputation: 1197
I am using this tutorial: http://spark.apache.org/docs/latest/quick-start.html to no avail.
I have tried the following:
textFile=sc.textFile("README.md")
textFile.count()
Below is the output that I receive instead of the desired result, 126.
> textFile=sc.textFile("README.md")
15/11/18 13:19:49 INFO MemoryStore: ensureFreeSpace(182712) called with curMem=2
54076, maxMem=556038881
15/11/18 13:19:49 INFO MemoryStore: Block broadcast_2 stored as values in memory
(estimated size 178.4 KB, free 529.9 MB)
15/11/18 13:19:49 INFO MemoryStore: ensureFreeSpace(17179) called with curMem=43
6788, maxMem=556038881
15/11/18 13:19:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in
memory (estimated size 16.8 KB, free 529.8 MB)
15/11/18 13:19:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on l
ocalhost:61916 (size: 16.8 KB, free: 530.2 MB)
15/11/18 13:19:49 INFO SparkContext: Created broadcast 2 from textFile at null:-
2
> textFile.count()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 1006, in cou
nt
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 997, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 871, in fold
vals = self.mapPartitions(func).collect()
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\rdd.py", line 773, in coll
ect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java
_gateway.py", line 538, in __call__
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\pyspark\sql\utils.py", line 36, in
deco
return f(*a, **kw)
File "C:\Users\Administrator\Downloads\spark-1.5.2-bin-hadoop2.4\spark-1.5.2-b
in-hadoop2.4\spark-1.5.2-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\prot
ocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.
api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: fil
e:/C:/Users/Administrator/Downloads/spark-1.5.2-bin-hadoop2.4/spark-1.5.2-bin-ha
doop2.4/spark-1.5.2-bin-hadoop2.4/bin/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(Fil
eInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
ava:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
va:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.
scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:5
8)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scal
a:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala
)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
Upvotes: 2
Views: 12160
Reputation: 60319
As @santon says, your input path does not exist; indeed, file README.md
lies under Spark home directory, and not under $SPARK_HOME/bin
. Here is the situation in Ubuntu:
~$ echo $SPARK_HOME
/usr/local/bin/spark-1.5.1-bin-hadoop2.6
~$ cd $SPARK_HOME
/usr/local/bin/spark-1.5.1-bin-hadoop2.6$ ls
bin conf ec2 lib NOTICE R RELEASE
CHANGES.txt data examples LICENSE python README.md sbin
So, since README.md
is not in your working directory, you should either provide the full path, or otherwise be sure that the file exists in your current working directory, which is where you have started pyspark
from:
/usr/local/bin/spark-1.5.1-bin-hadoop2.6$ ./bin/pyspark
[...]
>>> import os
>>> os.getcwd()
'/usr/local/bin/spark-1.5.1-bin-hadoop2.6'
>>> os.listdir(os.getcwd())
['lib', 'LICENSE', 'python', 'NOTICE', 'examples', 'ec2', 'README.md', 'conf', 'CHANGES.txt', 'R', 'data', 'RELEASE', 'bin', 'sbin']
Now, your code will work, since README.md
is in your working directory:
>>> textFile=sc.textFile("README.md")
[...]
>>> textFile.count()
[...]
98
BTW, the correct answer is 98 (cross-checked) - not sure why the tutorial asks for 126.
Summarizing, use os.listdir(os.getcwd())
to be sure that the file you are looking for exists in your current working directory; if yes, you can use your code above unmodified; if not, you should either provide the full file path, or change your working directory using the appropriate Python commands.
Upvotes: 3