Jay Cee
Jay Cee

Reputation: 1965

pyspark parse filename on load

I'm quite new to spark and there is one thing that I don't understand: how to manipulate column content.

I have a set of csv as follow:

enter image description here

each dsX is a table and I would like to load the data at once for each table.

So far no problems:

df = spark.read.format('csv') \
            .option("header", "true") \
            .option("escape", "\"") \
            .load(table+"/*")

But There is one information missing: the client_id and this client id is the first part of the csv name: clientId_table_category.csv

So I tried to do this:

def extract_path(patht):
    print(patht)
    return patht

df = spark.read.format('csv') \
            .option("header", "true") \
            .option("escape", "\"") \
            .load(table+"/*") \
            .withColumn("clientId", fn.lit(extract_path(fn.input_file_name())))

But the print returns: Column<b'input_file_name()'>

And I can't do much with this.

I'm quite stuck here, how do you manipulate data in this configuration?

Another solution for me is to load each csv one by one and parse the clientId from the file name manually, but I was wondering if there wouldn't be a more powerful solution with spark.

Upvotes: 0

Views: 1179

Answers (1)

Steven
Steven

Reputation: 15258

you are going a little too far away :

df = spark.read.csv(
    table+"/*",
    header=True,
    sep='\\'
).withColumn("clientId", fn.input_file_name())

this will create a column with the full path. Then you just need some extra string manipulation - easy using an UDF. You can also do that with builtin function but it is trickier.

from pyspark.sql.types import StringType

@fn.udf(StringType())
def get_id(in_string):
    return in_string.split("/")[-1].split("_")[0]


df = df.withColumn(
    "clientId",
    get_id(fn.col("clientId")
)

Upvotes: 1

Related Questions