Reputation: 346

Spark: udf to get dirname from path

I have insane amount of path columns I need to split in 2 columns, basename and dirname. I know how to get easily the basename of my paths using :

val df = Seq("/test/coucou/jambon/hello/file"
    ,"/test/jambon/test")
    .toDF("column1")
df.withColumn("basename", substring_index($"column1"  , "/", -1))
.show(2, false)
+------------------------------+---------+
|column1                       |basename |
+------------------------------+---------+
|/test/coucou/jambon/hello/file|file     |
|/test/jambon/test             |test     |
+------------------------------+---------+

However I'm strugling to get the dirname like this :

+------------------------------+--------------------------+
|column1                       |dirname                   |
+------------------------------+--------------------------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello |
|/test/jambon/test             |/test/jambon              |
+------------------------------+--------------------------+

I have tried various solution however I'm unable to find a functional columnar solution.
My best idea would be to subtract $"basename" to $"column1", however I couldn't find a way to subtract String in Spark.

Upvotes: 2

Answers (3)

Balaji Reddy

Reputation: 5700

Alternative approach to already provided solution using Regular expression

Get your regular expression right. regexp_extract UDF will give you what you wanted.

   val df = Seq("/test/coucou/jambon/hello/file"
      , "/test/jambon/prout/test")
      .toDF("column1")

    import org.apache.spark.sql.functions.regexp_extract

    df.withColumn("path", regexp_extract('column1, "^\\/(\\w+\\/)+", 0)).withColumn("fileName",regexp_extract('column1, "\\w+$", 0)).show(false)

Output

+------------------------------+--------------------------+--------+
|column1                       |path                      |fileName|
+------------------------------+--------------------------+--------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello/|file    |
|/test/jambon/prout/test       |/test/jambon/prout/       |test    |
+------------------------------+--------------------------+--------+

Edit:
Without trailing slash regex is easier to manage:

df.withColumn("path",regexp_extract($"column1", "^(.+)(/.+)$", 1 ) ) )

Upvotes: 3

effemm

Reputation: 151

Another approach would be to use UDFs:

import org.apache.spark.sql.functions.udf

val pathUDF = udf((s: String) => s.substring(0, s.lastIndexOf("/")))

val test = df.withColumn("basename", substring_index($"column1"  , "/", -1))
    .withColumn("path", pathUDF($"column1"))

test.show(false)
+------------------------------+--------+-------------------------+
|column1                       |basename|path                     |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file    |/test/coucou/jambon/hello|
|/test/jambon/prout/test       |test    |/test/jambon/prout       |
+------------------------------+--------+-------------------------+

Upvotes: 2

Apurba Pandey

Reputation: 1076

You can use expr to substring the column1. The code should look like below. I hope its helpful.

//Creating Test Data
val df = Seq("/test/coucou/jambon/hello/file"
  ,"/test/jambon/prout/test")
  .toDF("column1")

val test = df.withColumn("basename", substring_index($"column1"  , "/", -1))
    .withColumn("path", expr("substring(column1, 1, length(column1)-length(basename)-1)"))

test.show(false)
+------------------------------+--------+-------------------------+
|column1                       |basename|path                     |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file    |/test/coucou/jambon/hello|
|/test/jambon/prout/test       |test    |/test/jambon/prout       |
+------------------------------+--------+-------------------------+

Upvotes: 3

Spark: udf to get dirname from path

Answers (3)

Related Questions