optimist
optimist

Reputation: 1018

Accessing csv file placed in hdfs using spark

I have placed a csv file into the hdfs filesystem using hadoop -put command. I now need to access the csv file using pyspark csv. Its format is something like

`plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')`

I am a newbie to hdfs. How do I find the address to be placed in hdfs://x.x.x.x?

Here's the output when I entered

hduser@remus:~$ hdfs dfs -ls /input

Found 1 items
-rw-r--r--   1 hduser supergroup        158 2015-06-12 14:13 /input/test.csv

Any help is appreciated.

Upvotes: 1

Views: 2188

Answers (3)

Sairam Asapu
Sairam Asapu

Reputation: 13

Start the spark shell or the spark-submit by pointing to the package which can read csv files, like below:

spark-shell  --packages com.databricks:spark-csv_2.11:1.2.0

And in the spark code, you can read the csv file as below:

val data_df = sqlContext.read.format("com.databricks.spark.csv")
              .option("header", "true")
              .schema(<pass schema if required>)
              .load(<location in HDFS/S3>)

Upvotes: 0

Abhishek Choudhary
Abhishek Choudhary

Reputation: 8395

you need to provide the full path of your files in HDFS and the url will be mentioned in your hadoop configuration core-site or hdfs-site where you mentioned.

Check your core-site.xml & hdfs-site.xml for get the details about url.

Easy way to find any url is access your hdfs from your browser and get the path.

If you are using absolute path in your file system use file:///<your path>

Upvotes: 1

vvladymyrov
vvladymyrov

Reputation: 5793

Try to specify absolute path without hdfs://

plaintext_rdd = sc.textFile('/input/test.csv')

Spark while running on the same cluster with HDFS use hdfs:// as default FS.

Upvotes: 0

Related Questions