Reputation: 1645
I have a dataframe like this
df = pd.DataFrame(data=[980,169,104,74], columns=['Count'], index=['X,Y,Z', 'X,Z','X','Y,Z'])
Count
X, Y, Z 980
X,Z 169
X 104
Y,Z 74
I want to be able to extract association rules from this. I've seen that the Apriori algorithm is the reference. And also found the Orange library for data mining is well-known in this field.
But the problem is, in order to use the AssociationRulesInducer I need to create first a file containing all the transactions. Since my dataset is really huge (20 columns and 5 million rows) it will be too expensive to write all this data in a file and read it again with Orange.
Do you have any idea how can I take advantage of my current dataframe structure in order to find association rules ?
Upvotes: 4
Views: 6122
Reputation: 830
I know it is an old question, but to anyone running into this question when trying to use pandas dataframes for association rules and frequent itemsets (e.g. Apriori):
Have a look at this blog entry explaining how to do that using library mlxtend
.
My only recommendation regarding this great blog entry is that if you are dealing with large datasets, you may run into OOM errors for hot-encoded dataframes. I recommend using SparseDtypes then: df = df.astype(pd.SparseDtype(int, fill_value=0))
Upvotes: 0
Reputation: 7049
The new Orange3-Associate add-on for Orange data mining suite seems to include widgets and code that mines frequent itemsets (and from them association rules) even from sparse arrays or lists of lists, which may work for you.
With 5M rows, it'd be quite awesome if it did. :)
Upvotes: 2