Reputation: 299
I am processing CSV files from S3 using pyspark, however I wish to incorporate filename as a new column for which I am using the below code:
spark.udf.register("filenamefunc", lambda x: x.rsplit('/', 1)[-2])
df=spark.read.csv("s3a://exportcsv-battery/S5/243/101*",sep=',',header=True,inferSchema=True)
df=df.withColumn("filename", 'filenamefunc(input_file_name())')
But instead of filename, I want a substring of it, for example, if this is the input_file_name:-
s3a://exportcsv-battery/S5/243/101_002932_243_AAA_A_T01_AAA_AAA_0_0_0_0_2_10Hz.csv
I only want 243 to be extracted and stored in a new column for which I defined a UDF as:
spark.udf.register("filenamefunc", lambda x: x.rsplit('/', 1)[-2])
But it doesn't seem to work. Is there something I can do to fix it or a different approach? Thanks!
Upvotes: 0
Views: 1170
Reputation: 4059
You can use split()
function
import pyspark.sql.functions as f
[...]
df = df.withColumn('filename', f.split(f.input_file_name(), '/')[4])
Upvotes: 2