Reputation: 57
In PySpark data frame, I want to convert a string full file path into multiple rows of each parent path.
Input dataframe value:
ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt
Output : Every row should display an absolute path along with /
delimiter
ParentFolder/
ParentFolder/Folder1/
ParentFolder/Folder1/Folder2/
ParentFolder/Folder1/Folder2/Folder3/
ParentFolder/Folder1/Folder2/Folder3/Folder4/
ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt
Upvotes: 1
Views: 321
Reputation: 32640
You can split the column value
by /
delimiter to get all the parts of the path. Then using transform
function on the split result, you construct the different parent paths using slice
and array_join
functions:
from pyspark.sql import functions as F
df1 = df.withColumn("value", F.split(F.col("value"), "/")) \
.selectExpr("""
explode(
transform(value,
(x, i) -> struct(i+1 as rn, array_join(slice(value, 1, i+1), '/') ||
IF(i+1 < size(value), '/', '') as path)
)
) as paths
""").select("paths.*")
df1.show(truncate=False)
#+---+---------------------------------------------------------+
#|rn |path |
#+---+---------------------------------------------------------+
#|1 |ParentFolder/ |
#|2 |ParentFolder/Folder1/ |
#|3 |ParentFolder/Folder1/Folder2/ |
#|4 |ParentFolder/Folder1/Folder2/Folder3/ |
#|5 |ParentFolder/Folder1/Folder2/Folder3/Folder4/ |
#|6 |ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt|
#+---+---------------------------------------------------------+
For Spark < 2.4, you can use UDF like this:
import os
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType
def get_all_paths(path: str):
paths = [path]
for _ in range(path.count("/")):
path, base = os.path.split(path)
paths.append(path + "/")
return list(reversed(paths))
decompose_path = F.udf(get_all_paths, ArrayType(StringType()))
df1 = df.select(F.explode(decompose_path(F.col("value"))).alias("paths"))
Upvotes: 0
Reputation: 42332
You can use substring_index
as below:
df2 = df.selectExpr("""
explode(
transform(
sequence(1, size(split(col, '/'))),
(x, i) -> case when i = size(split(col, '/')) - 1
then col
else substring_index(col, '/', x) || '/'
end
)
) as col
""")
df2.show(20,0)
+---------------------------------------------------------+
|col |
+---------------------------------------------------------+
|ParentFolder/ |
|ParentFolder/Folder1/ |
|ParentFolder/Folder1/Folder2/ |
|ParentFolder/Folder1/Folder2/Folder3/ |
|ParentFolder/Folder1/Folder2/Folder3/Folder4/ |
|ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt|
+---------------------------------------------------------+
Upvotes: 1