Taras
Taras

Reputation: 469

Spark: how to read all files with different extension in directory recursively?

I have a directory structure in HDFS like this:

folder
├── sub1
│   ├── a
│   │   └── f1.txt
│   └── b
│       └── f2.parquet
└── sub2
    ├── a
    │   └── f3.jpg
    └── b
        └── f4.unknown

Is there a way to skip some files (with some unknown extension) while reading using spark. Can I read all files present in directory?

Upvotes: 1

Views: 5088

Answers (1)

Mohana B C
Mohana B C

Reputation: 5487

Spark provides different read APIs to handle different file formats.

Example:

If you want to read txt/csv files you can use spark.read.text or spark.read.csv method. For json format you can use spark.read.json, for parquet spark.read.parquet and so on. You need to use methods with respect to the file format to get proper dataframe.

Spark version < 3.0.0

Let's consider you have files with different format under the folder structure which you specified in your question. You need to use below code to read only csv files.

spark.read.csv("folder/sub1/a/*.csv", "folder/sub2/a/*.csv","folder/sub1/b/*.csv", "folder/sub2/b/*.csv")

spark version >=3.0.0

In this version instead of specifying each sub folder path, you can use options like pathGlobFilter, recursiveFileLookup and pass only parent folder path to the read method. Read this documentation.

Upvotes: 1

Related Questions