pyspark parse filename on load

Question

I'm quite new to spark and there is one thing that I don't understand: how to manipulate column content.

I have a set of csv as follow:

each dsX is a table and I would like to load the data at once for each table.

So far no problems:

df = spark.read.format('csv') \
            .option("header", "true") \
            .option("escape", "\"") \
            .load(table+"/*")

But There is one information missing: the client_id and this client id is the first part of the csv name: clientId_table_category.csv

So I tried to do this:

def extract_path(patht):
    print(patht)
    return patht

df = spark.read.format('csv') \
            .option("header", "true") \
            .option("escape", "\"") \
            .load(table+"/*") \
            .withColumn("clientId", fn.lit(extract_path(fn.input_file_name())))

But the print returns: Column

And I can't do much with this.

I'm quite stuck here, how do you manipulate data in this configuration?

Another solution for me is to load each csv one by one and parse the clientId from the file name manually, but I was wondering if there wouldn't be a more powerful solution with spark.

Steven · Accepted Answer

you are going a little too far away :

df = spark.read.csv(
    table+"/*",
    header=True,
    sep='\'
).withColumn("clientId", fn.input_file_name())

this will create a column with the full path. Then you just need some extra string manipulation - easy using an UDF. You can also do that with builtin function but it is trickier.

from pyspark.sql.types import StringType

@fn.udf(StringType())
def get_id(in_string):
    return in_string.split("/")[-1].split("_")[0]


df = df.withColumn(
    "clientId",
    get_id(fn.col("clientId")
)

pyspark parse filename on load

Answers (1)

Related Questions