Reputation: 1875
I can read few json-files at the same time using * (star):
sqlContext.jsonFile('/path/to/dir/*.json')
Is there any way to do the same thing for parquet? Star doesn't works.
Upvotes: 19
Views: 77689
Reputation: 61
For Read: Give the file's path and '*'
Example
pqtDF=sqlContext.read.parquet("Path_*.parquet")
Upvotes: 6
Reputation: 261
InputPath = [hdfs_path + "parquets/date=18-07-23/hour=2*/*.parquet",
hdfs_path + "parquets/date=18-07-24/hour=0*/*.parquet"]
df = spark.read.parquet(*InputPath)
Upvotes: 26
Reputation: 1143
FYI, you can also:
read subset of parquet files using the wildcard symbol * sqlContext.read.parquet("/path/to/dir/part_*.gz")
read multiple parquet files by explicitly specifying them sqlContext.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")
Upvotes: 31
Reputation: 2747
See this issue on the spark jira. It is supported from 1.4 onwards.
Without upgrading to 1.4, you could either point at the top level directory:
sqlContext.parquetFile('/path/to/dir/')
which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to parquetFile (it accepts varargs).
Upvotes: 11