Reputation: 346
I have insane amount of path columns I need to split in 2 columns, basename and dirname. I know how to get easily the basename of my paths using :
val df = Seq("/test/coucou/jambon/hello/file"
,"/test/jambon/test")
.toDF("column1")
df.withColumn("basename", substring_index($"column1" , "/", -1))
.show(2, false)
+------------------------------+---------+
|column1 |basename |
+------------------------------+---------+
|/test/coucou/jambon/hello/file|file |
|/test/jambon/test |test |
+------------------------------+---------+
However I'm strugling to get the dirname like this :
+------------------------------+--------------------------+
|column1 |dirname |
+------------------------------+--------------------------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello |
|/test/jambon/test |/test/jambon |
+------------------------------+--------------------------+
I have tried various solution however I'm unable to find a functional columnar solution.
My best idea would be to subtract $"basename"
to $"column1"
, however I couldn't find a way to subtract String in Spark.
Upvotes: 2
Views: 521
Reputation: 5700
Alternative approach to already provided solution using Regular expression
Get your regular expression right. regexp_extract UDF will give you what you wanted.
val df = Seq("/test/coucou/jambon/hello/file"
, "/test/jambon/prout/test")
.toDF("column1")
import org.apache.spark.sql.functions.regexp_extract
df.withColumn("path", regexp_extract('column1, "^\\/(\\w+\\/)+", 0)).withColumn("fileName",regexp_extract('column1, "\\w+$", 0)).show(false)
Output
+------------------------------+--------------------------+--------+
|column1 |path |fileName|
+------------------------------+--------------------------+--------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello/|file |
|/test/jambon/prout/test |/test/jambon/prout/ |test |
+------------------------------+--------------------------+--------+
Edit:
Without trailing slash regex is easier to manage:
df.withColumn("path",regexp_extract($"column1", "^(.+)(/.+)$", 1 ) ) )
Upvotes: 3
Reputation: 151
Another approach would be to use UDFs:
import org.apache.spark.sql.functions.udf
val pathUDF = udf((s: String) => s.substring(0, s.lastIndexOf("/")))
val test = df.withColumn("basename", substring_index($"column1" , "/", -1))
.withColumn("path", pathUDF($"column1"))
test.show(false)
+------------------------------+--------+-------------------------+
|column1 |basename|path |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file |/test/coucou/jambon/hello|
|/test/jambon/prout/test |test |/test/jambon/prout |
+------------------------------+--------+-------------------------+
Upvotes: 2
Reputation: 1076
You can use expr to substring the column1. The code should look like below. I hope its helpful.
//Creating Test Data
val df = Seq("/test/coucou/jambon/hello/file"
,"/test/jambon/prout/test")
.toDF("column1")
val test = df.withColumn("basename", substring_index($"column1" , "/", -1))
.withColumn("path", expr("substring(column1, 1, length(column1)-length(basename)-1)"))
test.show(false)
+------------------------------+--------+-------------------------+
|column1 |basename|path |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file |/test/coucou/jambon/hello|
|/test/jambon/prout/test |test |/test/jambon/prout |
+------------------------------+--------+-------------------------+
Upvotes: 3