Pyspark Obtain Substring from Filename and Store as New Column

Question

I am processing CSV files from S3 using pyspark, however I wish to incorporate filename as a new column for which I am using the below code:

spark.udf.register("filenamefunc", lambda x: x.rsplit('/', 1)[-2])
df=spark.read.csv("s3a://exportcsv-battery/S5/243/101*",sep=',',header=True,inferSchema=True)
df=df.withColumn("filename", 'filenamefunc(input_file_name())')

But instead of filename, I want a substring of it, for example, if this is the input_file_name:-

s3a://exportcsv-battery/S5/243/101_002932_243_AAA_A_T01_AAA_AAA_0_0_0_0_2_10Hz.csv

I only want 243 to be extracted and stored in a new column for which I defined a UDF as:

spark.udf.register("filenamefunc", lambda x: x.rsplit('/', 1)[-2])

But it doesn't seem to work. Is there something I can do to fix it or a different approach? Thanks!

Kafels · Accepted Answer

You can use split() function

import pyspark.sql.functions as f

[...]

df = df.withColumn('filename', f.split(f.input_file_name(), '/')[4])

Pyspark Obtain Substring from Filename and Store as New Column

Answers (1)

Related Questions