Mark McGown
Mark McGown

Reputation: 1115

How to implement FPGrowth algorithm in Python?

I've successfully used the apriori algorithm in Python as follows:

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

df = pd.read_csv('C:\\Users\\marka\\Downloads\\Assig5.csv')
df = apriori(df, min_support=0.79, use_colnames=True)
rules = association_rules(df, metric="lift", min_threshold=1)
rules[ (rules['lift'] >= 1) &
       (rules['confidence'] >= 1) ]

I'd like to use the FPGrowth algorithm to see if I get the same result but I believe I'm using it wrong since I don't get similar output. The documentation for spark (https://spark.apache.org/docs/1.6.0/mllib-frequent-pattern-mining.html) says:

from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("data/mllib/sample_fpgrowth.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
    print(fi)

So my code is in turn:

from pyspark.mllib.fpm import FPGrowth
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
data = sc.textFile("C:\\Users\\marka\\Downloads\\Assig6.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
    print(fi)

But I get the following instead of a real answer, what am I doing wrong?

FreqItemset(items=['1\t1\t1\t1\t1\t1\t1\t0\t0\t0\t0\t1\t1\t0\t0\t1\t1\t1\t1\t1\t0\t0'], freq=24)

To make Assig6 I just resaved my original csv as txt enter image description here

I started changing my format and updated my code per user10136092 but still get the undesired output. Here is my code, output, and a sample picture of my new input.

from pyspark.mllib.fpm import FPGrowth
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
data = sc.textFile("C:\\Users\\marka\\Downloads\\Assig2.txt")
data.map(lambda line: line.strip().split())
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
    print(fi)

Output

FreqItemset(items=['Rock_salt\tFlashlights\tWater\tSnow_shovels\tBlankets\tCanned_food'], freq=34)

enter image description here

Upvotes: 1

Views: 6210

Answers (2)

Litan Ilany
Litan Ilany

Reputation: 163

I think the file is tab-separated, so you should split it by '\t' instead of ' '

transactions = data.map(lambda line: line.strip().split('\t'))

Upvotes: 0

user10136092
user10136092

Reputation: 26

Your data is not a valid input for Spark FPGrowth algorithm.

In Spark each basket should be represented as a list of unique labels, for example:

baskets = sc.parallelize([("Rock Salt", "Blankets"), ("Blankets", "Dry Fruits", Canned Food")])

not a binary matrix, as for the other library you use. Please convert your data to aforementioned format first.

Additionally your data is tab not space separated, so even if the input was correct, you should split like

 data.map(lambda line: line.strip().split())

Upvotes: 1

Related Questions