NmdMystery
NmdMystery

Reputation: 2868

Spark/Hadoop can't find file on AWS EMR

I'm trying to read in a text file on Amazon EMR using the python spark libraries. The file is in the home directory (/home/hadoop/wet0), but spark can't seem to find it.

Line in question:

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

Error:

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-19-121.us-west-2.compute.internal:8020/user/hadoop/wet0;'

Does the file have to be in a specific directory? I can't find information about this anywhere on the AWS website.

Upvotes: 3

Views: 3057

Answers (2)

Lucas Penna
Lucas Penna

Reputation: 61

I don't know if it's just me, but when I tried to solve the problem with the suggestion above, I got an error "path does not exist" in my EMR. I just added one more "/" before user and it worked.

file:///user/hadoop/wet0

Thanks for the help!

Upvotes: 1

stevel
stevel

Reputation: 13430

If its in the local filesystem, the URL should be file://user/hadoop/wet0 If its in HDFS, that should be a valid path. Use the hadoop fs command to take a look

e.g: hadoop fs -ls /home/hadoop

one think to look at, you say it's in "/home/hadoop", but the path in the error is "/user/hadoop". Make sure you aren't using ~ in the command line, as bash will do the expansion before spark sees it. Best to use the full path /home/hadoop

Upvotes: 3

Related Questions