Danilo Dimitrijević
Danilo Dimitrijević

Reputation: 13

Read multiple parquet files using Apache Beam and ParquetIO

I need to read multiple parquet files in Apache Beam, all files are into the same folder. I've tried to read it using a wild card sign *.

I've managed to read separated parquet files using ParquetIO and this is the snippet how I read one parquet file:

pipeline.apply(ParquetIO.read(SCHEMA).from(filePath + File.separator + "*"));

where the filePath is for example /path/xxx.parquet.

The snippet of the code how I've tried to read multiple parquet files is

pipeline.apply(ParquetIO.read(SCHEMA).from(folderPath + File.separator + "*.parquet" + File.separator + "*"));

where the folder path is for example /path/to/parquet/files/

I also tried without the last part File.separator + "*", but it's the same result. The info I got is:

FileIO:654 - Matched 0 files for pattern /path/to/parquet/files/*.parquet/ *

Also, I can have various numbers and names of parquet files.

Is it possible to read multiple parquet files using Apache Beam, because I found the way to read multiple txt files?

Upvotes: 0

Views: 1172

Answers (1)

Alexey Romanenko
Alexey Romanenko

Reputation: 1443

Yes, it's possible to read multiple parquet files with ParquetIO since it uses FileIO under the hood. Just try to use another match pattern for that. In your case it could be something like this (I expect that folderPath is "/path/to"):

pipeline.apply(ParquetIO.read(SCHEMA).from(folderPath + File.separator + "parquet" + File.separator + "*" + File.separator + "*"));

or just double stars in the end:

pipeline.apply(ParquetIO.read(SCHEMA).from(folderPath + File.separator + "parquet" + File.separator + "**");

You can't use . as a part of glob pattern since it can be legitimate part of file path. Use ? to match any single character or * to match any string, within a single directory. Also, “**” pattern matches any string, and crosses directory boundaries.

Upvotes: 1

Related Questions