Reputation: 5381
I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:
s3://<bucket>/project1/20141201/logtype1/logtype1.0000.gz
s3://<bucket>/project1/20141201/logtype1/logtype1.0100.gz
....
s3://<bucket>/project1/20141201/logtype1/logtype1.2300.gz
Since the data should be analyzed on a daily basis, I have to download and decompress the files belongs to a specific day, then assemble the content as a single RDD.
There should be several ways can do this, but I would like to know the best practice for Spark.
Thanks in advance.
Upvotes: 17
Views: 20942
Reputation: 12692
The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.
From the Spark docs:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use
textFile("/my/directory")
,textFile("/my/directory/*.txt")
, andtextFile("/my/directory/*.gz")
.
So in your case you should be able to open all those files as a single RDD using something like this:
rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")
Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the *
and ?
wildcards.
For example:
rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")
Briefly, what this does is:
*
matches all strings, so in this case all gz
files in all folders under 201412??
will be loaded.?
matches a single character, so 201412??
will cover all days in December 2014 like 20141201
, 20141202
, and so forth.,
lets you just load separate files at once into the same RDD, like the random-file.txt
in this case.Some notes about the appropriate URL scheme for S3 paths:
s3://
.s3a://
is the way to go.s3n://
has been deprecated on the open source side in favor of s3a://
. You should only use s3n://
if you're running Spark on Hadoop 2.6 or older.Upvotes: 29
Reputation: 1397
Using AWS EMR with Spark 2.0.0 and SparkR in RStudio I've managed to read the gz compressed wikipedia stat files stored in S3 using the below command:
df <- read.text("s3://<bucket>/pagecounts-20110101-000000.gz")
Similarly, for all files under 'Jan 2011' you can use the above command like below:
df <- read.text("s3://<bucket>/pagecounts-201101??-*.gz")
See the SparkR API docs for more ways of doing it. https://spark.apache.org/docs/latest/api/R/read.text.html
Upvotes: 1
Reputation: 19995
Note: Under Spark 1.2, the proper format would be as follows:
val rdd = sc.textFile("s3n://<bucket>/<foo>/bar.*.gz")
That's s3n://
, not s3://
You'll also want to put your credentials in conf/spark-env.sh
as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
.
Upvotes: 7