Problems reading from EMR cluster in S3

I am developing an application on Java Spark. Generated and successfully loaded the .jar to the EMR cluster. There is one line of the code that reads:

JsonReader jsonReader = new JsonReader(new FileReader("s3://naturgy-sabt-dev/QUERY/input.json"));

I am 100% sure of:

When submitting the spark jar, I am getting the following error: (Note the printing of the route that it is going to be read right before calling the Java statement above put)

...
...
...
19/12/11 15:55:46 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.31.36.11, 35744, None)
19/12/11 15:55:46 INFO BlockManager: external shuffle service port = 7337
19/12/11 15:55:46 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.31.36.11, 35744, None)
19/12/11 15:55:48 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/local-1576079746613
19/12/11 15:55:48 INFO SharedState: Warehouse path is 'hdfs:///user/spark/warehouse'.
#########################################
I am going to read from s3://naturgy-sabt-dev/QUERY/input.json
#########################################
java.io.FileNotFoundException: s3:/naturgy-sabt-dev/QUERY/input.json (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at java.io.FileReader.<init>(FileReader.java:58)
...
...
...

Does anyone know what's going on?

Thanks for any help you can provide.

Upvotes: 0

Views: 1233

Answers (1)

dre-hh
dre-hh

Reputation: 8044

Java default Filereader cannot load files from aws s3 by. They can only be read with 3d party libs. The bare s3 reader is shipped within java aws sdk. However hadoop has also libraries to read from s3. Hadoop jars are preinstalled on aws emr spark cluster (actually on almost all spark installs).

Spark supports loading data from s3 filesystem into a spark dataframe directly without any manual steps. All readers can read either one file, or multiple files with same structure via a glob pattern. The json dataframe reader expects new-line delimited json by default. This can be configured.

various usage ways

# read single new-line delimited json file, each line is a record
spark.read.json("s3://path/input.json")

# read single serilized json object or array, spanning multiple lines.
spark.read.option("multiLine", true).json("s3://path/input.json")

# read multiple json files 
spark.read.json("s3://folder/*.json")

Upvotes: 1

Related Questions