Exporting PySpark Dataframe to Azure Data Lake Takes Forever

Question

The code below ran perfectly well on the standalone version of PySpark 2.4 on Mac OS (Python 3.7) when the size of the input data (around 6 GB) was small. However, when I ran the code on HDInsight cluster (HDI 4.0, i.e. Python 3.5, PySpark 2.4, 4 worker nodes and each has 64 cores and 432 GB of RAM, 2 header nodes and each has 4 cores and 28 GB of RAM, 2nd generation of data lake) with larger input data (169 GB), the last step, which is, writing data to the data lake, took forever (I killed it after 24 hours of execution) to complete. Given the fact that HDInsight is not popular in the cloud computing community, I could only reference posts that complained about the low speed when writing dataframe to S3. Some suggested to repartition the dataset, which I did, but it did not help.

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType, IntegerType, BooleanType
from pyspark.sql.functions import udf, regexp_extract, collect_set, array_remove, col, size, asc, desc
from pyspark.ml.fpm import FPGrowth
import os
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"

def work(order_path, beer_path, corpus_path, output_path, FREQ_THRESHOLD=1000, LIFT_THRESHOLD=1):
    print("Creating Spark Environment...")
    spark = SparkSession.builder.appName("Menu").getOrCreate()
    print("Spark Environment Created!")
    print("Working on Checkpoint1...")
    orders = spark.read.csv(order_path)
    orders.createOrReplaceTempView("orders")
    orders = spark.sql(
        "SELECT _c14 AS order_id, _c31 AS in_menu_id, _c32 AS in_menu_name FROM orders"
    )
    orders.createOrReplaceTempView("orders")
    beer = spark.read.csv(
        beer_path,
        header=True
    )
    beer.createOrReplaceTempView("beer")
    beer = spark.sql(
        """
        SELECT 
            order_id AS beer_order_id,
            in_menu_id AS beer_in_menu_id,
            '-999' AS beer_in_menu_name
        FROM beer
        """
    )
    beer.createOrReplaceTempView("beer")
    orders = spark.sql(
        """
        WITH orders_beer AS (
            SELECT *
            FROM orders
            LEFT JOIN beer
            ON orders.in_menu_id = beer.beer_in_menu_id
        )
        SELECT
            order_id,
            in_menu_id,
            CASE
                WHEN beer_in_menu_name IS NOT NULL THEN beer_in_menu_name
                WHEN beer_in_menu_name IS NULL THEN in_menu_name
            END AS menu_name
        FROM orders_beer
        """
    )
    print("Checkpoint1 Completed!")
    print("Working on Checkpoint2...")
    corpus = spark.read.csv(
        corpus_path,
        header=True
    )
    keywords = corpus.select("Food_Name").rdd.flatMap(lambda x: x).collect()
    orders = orders.withColumn(
        "keyword", 
        regexp_extract(
            "menu_name", 
            "(?=^|\s)(" + "|".join(keywords) + ")(?=\s|$)", 
            0
        )
    )
    orders.createOrReplaceTempView("orders")
    orders = spark.sql("""
        SELECT order_id, in_menu_id, keyword
        FROM orders
        WHERE keyword != ''
    """)
    orders.createOrReplaceTempView("orders")
    orders = orders.groupBy("order_id").agg(
        collect_set("keyword").alias("items")
    )
    print("Checkpoint2 Completed!")
    print("Working on Checkpoint3...")
    fpGrowth = FPGrowth(
        itemsCol="items", 
        minSupport=0, 
        minConfidence=0
    )
    model = fpGrowth.fit(orders)
    print("Checkpoint3 Completed!")
    print("Working on Checkpoint4...")
    frequency = model.freqItemsets
    frequency = frequency.filter(col("freq") > FREQ_THRESHOLD)
    frequency = frequency.withColumn(
        "items", 
        array_remove("items", "-999")
    )
    frequency = frequency.filter(size(col("items")) > 0)
    frequency = frequency.orderBy(asc("items"), desc("freq"))
    frequency = frequency.dropDuplicates(["items"])
    frequency = frequency.withColumn(
        "antecedent", 
        udf(
            lambda x: "|".join(sorted(x)), StringType()
        )(frequency.items)
    )
    frequency.createOrReplaceTempView("frequency")
    lift = model.associationRules
    lift = lift.drop("confidence")
    lift = lift.filter(col("lift") > LIFT_THRESHOLD)
    lift = lift.filter(
        udf(
            lambda x: x == ["-999"], BooleanType()
        )(lift.consequent)
    )
    lift = lift.drop("consequent")
    lift = lift.withColumn(
        "antecedent", 
        udf(
            lambda x: "|".join(sorted(x)), StringType()
        )(lift.antecedent)
    )
    lift.createOrReplaceTempView("lift")
    result = spark.sql(
        """
        SELECT lift.antecedent, freq AS frequency, lift
        FROM lift
        INNER JOIN frequency
        ON lift.antecedent = frequency.antecedent
        """
    )
    print("Checkpoint4 Completed!")
    print("Writing Result to Data Lake...")
    result.repartition(1024).write.mode("overwrite").parquet(output_path)
    print("All Done!")

def main():
    work(
        order_path=169.1 GB of txt,
        beer_path=4.9 GB of csv,
        corpus_path=210 KB of csv,
        output_path="final_result.parquet"
    )

if __name__ == "__main__":
    main()

I first thought this was caused by the file format parquet. However, when I tried csv, I met with the same problem. I tried result.count() to see how many rows the table result has. It took forever to get the row number, just like writing the data to the data lake. There was a suggestion to use broadcast hash join instead of the default sort-merge join if a large dataset is joined with a small dataset. I thought it was worth trying because the smaller samples in the pilot study told me the row number of frequency is roughly 0.09% of that of lift (See the query below if you have difficulty tracking frequency and lift).

SELECT lift.antecedent, freq AS frequency, lift
FROM lift
INNER JOIN frequency
ON lift.antecedent = frequency.antecedent

With that in mind, I revised my code:

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType, IntegerType, BooleanType
from pyspark.sql.functions import udf, regexp_extract, collect_set, array_remove, col, size, asc, desc
from pyspark.ml.fpm import FPGrowth
import os
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"

def work(order_path, beer_path, corpus_path, output_path, FREQ_THRESHOLD=1000, LIFT_THRESHOLD=1):
    print("Creating Spark Environment...")
    spark = SparkSession.builder.appName("Menu").getOrCreate()
    print("Spark Environment Created!")
    print("Working on Checkpoint1...")
    orders = spark.read.csv(order_path)
    orders.createOrReplaceTempView("orders")
    orders = spark.sql(
        "SELECT _c14 AS order_id, _c31 AS in_menu_id, _c32 AS in_menu_name FROM orders"
    )
    orders.createOrReplaceTempView("orders")
    beer = spark.read.csv(
        beer_path,
        header=True
    )
    beer.createOrReplaceTempView("beer")
    beer = spark.sql(
        """
        SELECT 
            order_id AS beer_order_id,
            in_menu_id AS beer_in_menu_id,
            '-999' AS beer_in_menu_name
        FROM beer
        """
    )
    beer.createOrReplaceTempView("beer")
    orders = spark.sql(
        """
        WITH orders_beer AS (
            SELECT *
            FROM orders
            LEFT JOIN beer
            ON orders.in_menu_id = beer.beer_in_menu_id
        )
        SELECT
            order_id,
            in_menu_id,
            CASE
                WHEN beer_in_menu_name IS NOT NULL THEN beer_in_menu_name
                WHEN beer_in_menu_name IS NULL THEN in_menu_name
            END AS menu_name
        FROM orders_beer
        """
    )
    print("Checkpoint1 Completed!")
    print("Working on Checkpoint2...")
    corpus = spark.read.csv(
        corpus_path,
        header=True
    )
    keywords = corpus.select("Food_Name").rdd.flatMap(lambda x: x).collect()
    orders = orders.withColumn(
        "keyword", 
        regexp_extract(
            "menu_name", 
            "(?=^|\s)(" + "|".join(keywords) + ")(?=\s|$)", 
            0
        )
    )
    orders.createOrReplaceTempView("orders")
    orders = spark.sql("""
        SELECT order_id, in_menu_id, keyword
        FROM orders
        WHERE keyword != ''
    """)
    orders.createOrReplaceTempView("orders")
    orders = orders.groupBy("order_id").agg(
        collect_set("keyword").alias("items")
    )
    print("Checkpoint2 Completed!")
    print("Working on Checkpoint3...")
    fpGrowth = FPGrowth(
        itemsCol="items", 
        minSupport=0, 
        minConfidence=0
    )
    model = fpGrowth.fit(orders)
    print("Checkpoint3 Completed!")
    print("Working on Checkpoint4...")
    frequency = model.freqItemsets
    frequency = frequency.filter(col("freq") > FREQ_THRESHOLD)
    frequency = frequency.withColumn(
        "antecedent", 
        array_remove("items", "-999")
    )
    frequency = frequency.drop("items")
    frequency = frequency.filter(size(col("antecedent")) > 0)
    frequency = frequency.orderBy(asc("antecedent"), desc("freq"))
    frequency = frequency.dropDuplicates(["antecedent"])
    frequency = frequency.withColumn(
        "antecedent", 
        udf(
            lambda x: "|".join(sorted(x)), StringType()
        )(frequency.antecedent)
    )
    lift = model.associationRules
    lift = lift.drop("confidence")
    lift = lift.filter(col("lift") > LIFT_THRESHOLD)
    lift = lift.filter(
        udf(
            lambda x: x == ["-999"], BooleanType()
        )(lift.consequent)
    )
    lift = lift.drop("consequent")
    lift = lift.withColumn(
        "antecedent", 
        udf(
            lambda x: "|".join(sorted(x)), StringType()
        )(lift.antecedent)
    )
    result = lift.join(
        frequency.hint("broadcast"), 
        ["antecedent"], 
        "inner"
    )
    print("Checkpoint4 Completed!")
    print("Writing Result to Data Lake...")
    result.repartition(1024).write.mode("overwrite").parquet(output_path)
    print("All Done!")

def main():
    work(
        order_path=169.1 GB of txt,
        beer_path=4.9 GB of csv,
        corpus_path=210 KB of csv,
        output_path="final_result.parquet"
    )

if __name__ == "__main__":
    main()

The code ran perfectly well with the same sample data on my Mac OS and as expected took less time (34 seconds vs. 26 seconds). Then I decided to run the code to HDInsight with full datasets. In the last step, which is writing data to the data lake, the task failed and I was told Job cancelled because SparkContext was shut down. I am rather new to big data and have no idea with this means. Posts on the internet said there could be many reasons behind it. Whatever the method I should use, how to optimize my code so I can get the desired output in the data lake within bearable amount of time?

James Chang · Accepted Answer

I have figured it out after five days' struggle. Here are the approaches that I took to optimize the code. The time of code execution dropped from more than 24 hours to around 10 minutes. Code optimization is really really important.

As David Taub below pointed out, use df.cache() after heavy computation or before feeding the data to the model. I used df.cache().count() since calling .cache() on its own is lazily evaluated but the following .count() forces an evaluation of the entire dataset.
Use flashtext instead of regular expression to extract keywords. This greatly improves code performance.
Be careful with joins / merge. It might get extremely slow due to data skewness. Always think about ways to avoid unnecessary joins.
Set minSupport for FPGrowth. This significantly reduces the time when calling model.freqItemsets.

Exporting PySpark Dataframe to Azure Data Lake Takes Forever

Answers (2)

Related Questions