Pyspark: Read multiple csv files and annotate them with the source

Question

We've been running into issues with bulk file ingestion into spark.

Currently, I'm aware that multiple file ingestion can be done with wildcards

spark.read.csv("path/to/file*.csv")

or via passing a list of paths of interest

spark.read.csv(["path/to/file1.csv", "path/to/file2.csv"])

In our situation, we have a large number of files (>100k) with file_name encoded IDs and no id encoded within the table itself. Using either method above acts like a simple union of the files, and doesn't seem to allow for the storage of the filename anywhere within the dataset.

How would I go about combining all these csvs while maintaining the filename encoded ID.

Steven · Accepted Answer

there is a simple function called input_file_name.

from pyspark.sql import functions as F

df = spark.read.csv("path/to/file*.csv").withColumn("filename", F.input_file_name())

Pyspark: Read multiple csv files and annotate them with the source

Answers (1)

Related Questions