Kenneth Lim
Kenneth Lim

Reputation: 85

Pyspark: Read multiple csv files and annotate them with the source

We've been running into issues with bulk file ingestion into spark.

Currently, I'm aware that multiple file ingestion can be done with wildcards

spark.read.csv("path/to/file*.csv")

or via passing a list of paths of interest

spark.read.csv(["path/to/file1.csv", "path/to/file2.csv"])

In our situation, we have a large number of files (>100k) with file_name encoded IDs and no id encoded within the table itself. Using either method above acts like a simple union of the files, and doesn't seem to allow for the storage of the filename anywhere within the dataset.

How would I go about combining all these csvs while maintaining the filename encoded ID.

Upvotes: 3

Views: 3780

Answers (1)

Steven
Steven

Reputation: 15258

there is a simple function called input_file_name.

from pyspark.sql import functions as F

df = spark.read.csv("path/to/file*.csv").withColumn("filename", F.input_file_name())

Upvotes: 5

Related Questions