Using PrefixSpan with Python to find most common sequences

Question

I'm trying to use PrefixSpan to identify common sequences with a minimum support threshold of .2 and (if possible) sequences between 2 and 6 actions. Order matters, so I'd like to differentiate between a sequence ['1', '2', '3'] and ['1', '3', '2']. For some additional context, the data comes from students who are completing a classroom activity, and I'd like to understand how an intervention may affect how they navigate through that environment.

The raw data has been cleaned such that each row simply has a student ID and a list of their actions while they completed the activity.

What I would like the program to output is a csv that lists each eligible sequence, as well as the number of times it showed up in the dataset. Here is my code so far:

import csv
from prefixspan import PrefixSpan

def PS_run(input_file, output_file, min_support_percent=20):
    # Define the minimum support threshold as a percentage

    try:
      # Read the CSV file and store the sequences in a list
      sequences = []
      with open(input_file, 'r', newline='') as file:
        reader = csv.reader(file)
        for row in reader:
            user_id, activity1 = row[1], row[2]  # Updated column indices
            # Assuming the actions are already in Python list format
            # If not, you can convert them using eval() as shown in previous responses
            sequences.append(activity1)

      #Define the min support thresh bsaed on the total number of seq
      min_support = int(len(sequences) * (min_support_percent / 100))

      # Create a PrefixSpan object and fit it to the data
      ps = PrefixSpan(sequences)
      frequent_patterns = ps.frequent(min_support)

      # Write the frequent patterns to an output file
      with open(output_file, 'w', newline='') as out_file:
        writer = csv.writer(out_file)
        for pattern, support in frequent_patterns:
            writer.writerow([str(pattern), str(support)])

      # Display message indicated completion
      print("Done")

    except Exception as e:
      print("Woopsies!", str(e))

Currently, the program takes forever to process or just comes back with an empty csv.

Using PrefixSpan with Python to find most common sequences

Answers (0)

Related Questions