Dyex719
Dyex719

Reputation: 19

Pyspark FP growth implementation running slow

I am using the pyspark.ml.fpm (FP Growth) implementation of association rule mining on Spark v2.3.

The spark UI shows that the tasks as the end run very slowly. This seems to be a common problem and might be related to data skew.

Is this the real reason? Is there any solution for this?

I don't want to change the minSupport or minConfidence thresholds because that would effect by results. Removing the columns isn't a solution either.

Upvotes: 0

Views: 1209

Answers (2)

maz
maz

Reputation: 11

Late answer but I also had an issue with long FPGrowth wait times, and the above answer really helped. Implemented as such to filter out anything that's above one standard deviation (this is after the transactions have been grouped):

def clean_transactions(df):
    transactions_init = df.withColumn("basket_size", size("basket"))
    print('---collecting stats')
    df_stats = transactions_init.select(
        _mean(col('basket_size')).alias('mean'),
        _stddev(col('basket_size')).alias('std')
    ).collect()
    mean = df_stats[0]['mean']
    std = df_stats[0]['std']
    max_ct = mean + std
    print("--filtering out outliers")
    transactions_cleaned = transactions_init.filter(transactions_init.basket_size <= max_ct)
    return transactions_cleaned

Upvotes: 1

Jasper
Jasper

Reputation: 101

I was facing a similar issue. One solution you might try is setting a threshold on the amount of products in a transaction. If there are a couple of transactions that have way more products than the average, the tree computed by FP Growth blows up. This causes the runtime increases significantly and the risk for memory errors is much higher.

Hence, doing outlier removal on the transactions with disproportional amount of products might do the trick.

Hope this helps you out a bit :)

Upvotes: 1

Related Questions