Ravi
Ravi

Reputation: 3303

Issue reading CSV file in to Spark

I am trying to load a CSV file into HDFS and read the same into Spark as RDDs. I am using Hortonworks Sandbox and trying these through the command line. I loaded the data as follows:

hadoop fs -put data.csv /

The data seems to have loaded properly as seen by the following command:

[root@sandbox temp]# hadoop fs -ls /data.csv
-rw-r--r--   1 hdfs hdfs   70085496 2015-10-04 14:17 /data.csv

In pyspark, I tried reading this file as follows:

data = sc.textFile('/data.csv')

However, the following take command throws an error:

data.take(5)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/rdd.py", line 1194, in take
    totalParts = self._jrdd.partitions().size()
  File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1- src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/data.csv

Can someone help me with this error?

Upvotes: 2

Views: 3454

Answers (3)

mnis.p
mnis.p

Reputation: 451

In case you want to create an rdd for any text or csv file available in the local system, use

rdd = sc.textFile("file://path/to/csv or text file")

Upvotes: 0

Ravi
Ravi

Reputation: 3303

I figured the answer out. I had to enter the complete path name of the HDFS file as follows:

data = sc.textFile('hdfs://sandbox.hortonworks.com:8020/data.csv')

The full path name is obtained from conf/core-site.xml

Upvotes: 3

WoodChopper
WoodChopper

Reputation: 4375

Error org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/data.csv

It is reading from you local file system instead of hdfs.

Try providing file path like below,

data = sc.textFile("hdfs://data.csv")

Upvotes: 0

Related Questions