Convert full file path into multiple rows of parents absolute path in PySpark

Question

In PySpark data frame, I want to convert a string full file path into multiple rows of each parent path.

Input dataframe value:

ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt

Output : Every row should display an absolute path along with / delimiter

ParentFolder/
ParentFolder/Folder1/
ParentFolder/Folder1/Folder2/
ParentFolder/Folder1/Folder2/Folder3/
ParentFolder/Folder1/Folder2/Folder3/Folder4/
ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt

blackbishop · Accepted Answer

You can split the column value by / delimiter to get all the parts of the path. Then using transform function on the split result, you construct the different parent paths using slice and array_join functions:

from pyspark.sql import functions as F

df1 = df.withColumn("value", F.split(F.col("value"), "/")) \
    .selectExpr("""
      explode(
           transform(value, 
                     (x, i) -> struct(i+1 as rn, array_join(slice(value, 1, i+1), '/') ||
                                      IF(i+1 < size(value), '/', '') as path)
                     )
       ) as paths
    """).select("paths.*")


df1.show(truncate=False)

#+---+---------------------------------------------------------+
#|rn |path                                                     |
#+---+---------------------------------------------------------+
#|1  |ParentFolder/                                            |
#|2  |ParentFolder/Folder1/                                    |
#|3  |ParentFolder/Folder1/Folder2/                            |
#|4  |ParentFolder/Folder1/Folder2/Folder3/                    |
#|5  |ParentFolder/Folder1/Folder2/Folder3/Folder4/            |
#|6  |ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt|
#+---+---------------------------------------------------------+

For Spark < 2.4, you can use UDF like this:

import os
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType


def get_all_paths(path: str):
    paths = [path]
    for _ in range(path.count("/")):
        path, base = os.path.split(path)
        paths.append(path + "/")

    return list(reversed(paths))


decompose_path = F.udf(get_all_paths, ArrayType(StringType()))

df1 = df.select(F.explode(decompose_path(F.col("value"))).alias("paths"))

Convert full file path into multiple rows of parents absolute path in PySpark

Answers (2)

Related Questions