DarqMoth
DarqMoth

Reputation: 603

Scalding Tutorial with HDFS: Data is missing from one or more paths in: List(tutorial/data/hello.txt)

After configuring ssh and rsync when I try to run Scalding tutorial (https://github.com/Cascading/scalding-tutorial/) with command:

$ scripts/scald.rb --hdfs tutorial/Tutorial0.scala

I get the following error:

com.twitter.scalding.InvalidSourceException: [com.twitter.scalding.TextLineWrappedArray(tutorial/data/hello.txt)] Data is missing from one or more paths in: List(tutorial/data/hello.txt)

This error happens notwithstanding file tutorial/data/hello.txt really exists.

How to fix this?

Stdout:

$ scripts/scald.rb --hdfs tutorial/Tutorial0.scala
scripts/scald.rb:194: warning: already initialized constant SCALA_LIB_DIR
[email protected]'s password: 
[email protected]'s password: 

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/phoenix/phoenix-4.0.0.2.1.2.1-471-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/07/07 19:05:45 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/07/07 19:05:45 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
Exception in thread "main" java.lang.Throwable: GUESS: Data is missing from the path you provided.
If you know what exactly caused this error, please consider contributing to GitHub via following link.
https://github.com/twitter/scalding/wiki/Common-Exceptions-and-possible-reasons#comtwitterscaldinginvalidsourceexception
    at com.twitter.scalding.Tool$.main(Tool.scala:132)
    at com.twitter.scalding.Tool.main(Tool.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Upvotes: 0

Views: 596

Answers (2)

Gianmario Spacagna
Gianmario Spacagna

Reputation: 1300

Try to pack your job into a fat jar using Maven shade plugin and then run your Scalding job via the hadoop command:

hadoop jar your-uber.jar com.twitter.scalding.Tool bar.foo.MyClassJob --hdfs --input ... --output ...

Upvotes: 0

Balduz
Balduz

Reputation: 3570

I think the problem you are having is that you are telling Scalding to run in HDFS, but the file you are providing as input is in your local file system, not in HDFS. Before running the example, upload the file to your HDFS:

hadoop fs -mkdir tutorial
hadoop fs -mkdir tutorial/data
hadoop fs -put tutorial/data/hello.txt tutorial/data/hello.txt

Upvotes: 3

Related Questions