How to implement FPGrowth algorithm in Python?

Question

I've successfully used the apriori algorithm in Python as follows:

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

df = pd.read_csv('C:\Users\marka\Downloads\Assig5.csv')
df = apriori(df, min_support=0.79, use_colnames=True)
rules = association_rules(df, metric="lift", min_threshold=1)
rules[ (rules['lift'] >= 1) &
       (rules['confidence'] >= 1) ]

I'd like to use the FPGrowth algorithm to see if I get the same result but I believe I'm using it wrong since I don't get similar output. The documentation for spark (https://spark.apache.org/docs/1.6.0/mllib-frequent-pattern-mining.html) says:

from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("data/mllib/sample_fpgrowth.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
    print(fi)

So my code is in turn:

from pyspark.mllib.fpm import FPGrowth
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
data = sc.textFile("C:\Users\marka\Downloads\Assig6.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
    print(fi)

But I get the following instead of a real answer, what am I doing wrong?

FreqItemset(items=['1	1	1	1	1	1	1	0	0	0	0	1	1	0	0	1	1	1	1	1	0	0'], freq=24)

To make Assig6 I just resaved my original csv as txt

I started changing my format and updated my code per user10136092 but still get the undesired output. Here is my code, output, and a sample picture of my new input.

from pyspark.mllib.fpm import FPGrowth
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
data = sc.textFile("C:\Users\marka\Downloads\Assig2.txt")
data.map(lambda line: line.strip().split())
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
    print(fi)

Output

FreqItemset(items=['Rock_salt	Flashlights	Water	Snow_shovels	Blankets	Canned_food'], freq=34)

user10136092 · Accepted Answer

Your data is not a valid input for Spark FPGrowth algorithm.

In Spark each basket should be represented as a list of unique labels, for example:

baskets = sc.parallelize([("Rock Salt", "Blankets"), ("Blankets", "Dry Fruits", Canned Food")])

not a binary matrix, as for the other library you use. Please convert your data to aforementioned format first.

Additionally your data is tab not space separated, so even if the input was correct, you should split like

 data.map(lambda line: line.strip().split())

How to implement FPGrowth algorithm in Python?

Answers (2)

Related Questions