Reputation: 13
I need to read multiple parquet files in Apache Beam, all files are into the same folder. I've tried to read it using a wild card sign *.
I've managed to read separated parquet files using ParquetIO and this is the snippet how I read one parquet file:
pipeline.apply(ParquetIO.read(SCHEMA).from(filePath + File.separator + "*"));
where the filePath is for example /path/xxx.parquet.
The snippet of the code how I've tried to read multiple parquet files is
pipeline.apply(ParquetIO.read(SCHEMA).from(folderPath + File.separator + "*.parquet" + File.separator + "*"));
where the folder path is for example /path/to/parquet/files/
I also tried without the last part File.separator + "*", but it's the same result. The info I got is:
FileIO:654 - Matched 0 files for pattern /path/to/parquet/files/*.parquet/ *
Also, I can have various numbers and names of parquet files.
Is it possible to read multiple parquet files using Apache Beam, because I found the way to read multiple txt files?
Upvotes: 0
Views: 1172
Reputation: 1443
Yes, it's possible to read multiple parquet files with ParquetIO
since it uses FileIO
under the hood. Just try to use another match pattern for that. In your case it could be something like this (I expect that folderPath
is "/path/to"):
pipeline.apply(ParquetIO.read(SCHEMA).from(folderPath + File.separator + "parquet" + File.separator + "*" + File.separator + "*"));
or just double stars in the end:
pipeline.apply(ParquetIO.read(SCHEMA).from(folderPath + File.separator + "parquet" + File.separator + "**");
You can't use .
as a part of glob pattern since it can be legitimate part of file path. Use ?
to match any single character or *
to match any string, within a single directory. Also, “**” pattern matches any string, and crosses directory boundaries.
Upvotes: 1