Reputation: 1
I'm trying to use PrefixSpan to identify common sequences with a minimum support threshold of .2 and (if possible) sequences between 2 and 6 actions. Order matters, so I'd like to differentiate between a sequence ['1', '2', '3'] and ['1', '3', '2']. For some additional context, the data comes from students who are completing a classroom activity, and I'd like to understand how an intervention may affect how they navigate through that environment.
The raw data has been cleaned such that each row simply has a student ID and a list of their actions while they completed the activity.
What I would like the program to output is a csv that lists each eligible sequence, as well as the number of times it showed up in the dataset. Here is my code so far:
import csv
from prefixspan import PrefixSpan
def PS_run(input_file, output_file, min_support_percent=20):
# Define the minimum support threshold as a percentage
try:
# Read the CSV file and store the sequences in a list
sequences = []
with open(input_file, 'r', newline='') as file:
reader = csv.reader(file)
for row in reader:
user_id, activity1 = row[1], row[2] # Updated column indices
# Assuming the actions are already in Python list format
# If not, you can convert them using eval() as shown in previous responses
sequences.append(activity1)
#Define the min support thresh bsaed on the total number of seq
min_support = int(len(sequences) * (min_support_percent / 100))
# Create a PrefixSpan object and fit it to the data
ps = PrefixSpan(sequences)
frequent_patterns = ps.frequent(min_support)
# Write the frequent patterns to an output file
with open(output_file, 'w', newline='') as out_file:
writer = csv.writer(out_file)
for pattern, support in frequent_patterns:
writer.writerow([str(pattern), str(support)])
# Display message indicated completion
print("Done")
except Exception as e:
print("Woopsies!", str(e))
Currently, the program takes forever to process or just comes back with an empty csv.
Upvotes: 0
Views: 312