How to perform this sort operation in python

Question

I am creating a module to analyse frequencies of patterns of tokens and delimiters in a given text split up into sentences.

I have a class "SequencePattern" which identifies one element (token or delimiter) in a set of tokenised sentences, where each SequencePattern has a list attribute "occurrences" consisting of tuples ( n_sentence, n_element ) where this particular element actual occurs. Class SequencePattern has a class-level field, seq_patterns (a set), where all the individual SequencePattern instances are stored.

At this stage in the processing I only have single-element SequencePatterns, and have weeded out all such SequencePatterns having < 2 occurrences. But SequencePattern is a subclass of tuple and the idea is now to find the "two element" SequencePatterns.

The next thing I need to do is to go through all the one-element SequencePatterns which remain after weeding, identifying spots where you find two (or more) adjacent occurrences in the same sentence, i.e. where n_sentence is the same and n_element differs by 1.

So I need to do something along these lines:

occurrences_by_text_order = sorted( SequencePattern.seq_patterns.occurrences )

... but of course this doesn't work: I get

AttributeError: 'set' object has no attribute 'occurences'

Somehow I need to do an iteration of all SequencePatterns in seq_patterns and then, for each, a "nested" iteration of all occurrences for each of these... and I need to submit this mass of delivered tuples ( n_sentence, n_element ) to the sorted function.

I'm not an experienced Pythonista but I have a suspicion this is a job for a generator (?). Can anyone help?

mike rodent · Accepted Answer

def get_occurrences():
    for seq_patt in SequencePattern.seq_patterns:
        for occurrence in seq_patt.occurrences:
            yield occurrence
occurrences_by_text_order = sorted( get_occurrences() )

The following then prints out a list of all the two-element sequences which may occur more than once (we now know that there is no possibility of two-element sequences with frequency > 1 occurring anywhere else):

prev_occurrence = None
for occurrence in sorted( occurrence for seq_patt in SequencePattern.seq_patterns for occurrence in seq_patt.occurrences ):
    if prev_occurrence and ( occurrence[ 0 ] == prev_occurrence[ 0 ] ) and ( occurrence[ 1 ] - prev_occurrence[ 1 ] == 1 ):  
        print( '# prev_occurrence %s occurrence: %s' % ( prev_occurrence, occurrence, ))
    prev_occurrence = occurrence

How to perform this sort operation in python

Answers (1)

Related Questions