Reputation: 85
We've been running into issues with bulk file ingestion into spark.
Currently, I'm aware that multiple file ingestion can be done with wildcards
spark.read.csv("path/to/file*.csv")
or via passing a list of paths of interest
spark.read.csv(["path/to/file1.csv", "path/to/file2.csv"])
In our situation, we have a large number of files (>100k) with file_name encoded IDs and no id encoded within the table itself. Using either method above acts like a simple union of the files, and doesn't seem to allow for the storage of the filename anywhere within the dataset.
How would I go about combining all these csvs while maintaining the filename encoded ID.
Upvotes: 3
Views: 3780
Reputation: 15258
there is a simple function called input_file_name
.
from pyspark.sql import functions as F
df = spark.read.csv("path/to/file*.csv").withColumn("filename", F.input_file_name())
Upvotes: 5