how to access multiple json files using dataframe from S3

Question

I am using apapche spark. I want to access multiple json files from spark on date basis. How can i pick multiple files i.e. i want to provide range that files ending with 1034.json up to files ending with 1434.json. I am trying this.

DataFrame df = sql.read().json("s3://..../..../.....-.....[1034*-1434*]");

But i am getting the following error

   at java.util.regex.Pattern.error(Pattern.java:1924)
    at java.util.regex.Pattern.range(Pattern.java:2594)
    at java.util.regex.Pattern.clazz(Pattern.java:2507)
    at java.util.regex.Pattern.sequence(Pattern.java:2030)
    at java.util.regex.Pattern.expr(Pattern.java:1964)
    at java.util.regex.Pattern.compile(Pattern.java:1665)
    at java.util.regex.Pattern.(Pattern.java:1337)
    at java.util.regex.Pattern.compile(Pattern.java:1022)
    at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156)
    at org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42)
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67)

Please specify a way out.

Shankar · Accepted Answer

You can read something like this.

sqlContext.read().json("s3n://bucket/filepath/*.json")

Also, you can use wildcards in the file path.

For example:

sqlContext.read().json("s3n://*/*/*-*[1034*-1434*]")

how to access multiple json files using dataframe from S3

Answers (1)

Related Questions