hitttt
hitttt

Reputation: 1189

how to access multiple json files using dataframe from S3

I am using apapche spark. I want to access multiple json files from spark on date basis. How can i pick multiple files i.e. i want to provide range that files ending with 1034.json up to files ending with 1434.json. I am trying this.

DataFrame df = sql.read().json("s3://..../..../.....-.....[1034*-1434*]");

But i am getting the following error

   at java.util.regex.Pattern.error(Pattern.java:1924)
    at java.util.regex.Pattern.range(Pattern.java:2594)
    at java.util.regex.Pattern.clazz(Pattern.java:2507)
    at java.util.regex.Pattern.sequence(Pattern.java:2030)
    at java.util.regex.Pattern.expr(Pattern.java:1964)
    at java.util.regex.Pattern.compile(Pattern.java:1665)
    at java.util.regex.Pattern.<init>(Pattern.java:1337)
    at java.util.regex.Pattern.compile(Pattern.java:1022)
    at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156)
    at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67)

Please specify a way out.

Upvotes: 4

Views: 2474

Answers (1)

Shankar
Shankar

Reputation: 8967

You can read something like this.

sqlContext.read().json("s3n://bucket/filepath/*.json")

Also, you can use wildcards in the file path.

For example:

sqlContext.read().json("s3n://*/*/*-*[1034*-1434*]")

Upvotes: 3

Related Questions