Reputation: 15642
I am creating a module to analyse frequencies of patterns of tokens and delimiters in a given text split up into sentences.
I have a class "SequencePattern" which identifies one element (token or delimiter) in a set of tokenised sentences, where each SequencePattern has a list attribute "occurrences" consisting of tuples ( n_sentence, n_element
) where this particular element actual occurs. Class SequencePattern has a class-level field, seq_patterns (a set
), where all the individual SequencePattern instances are stored.
At this stage in the processing I only have single-element SequencePatterns, and have weeded out all such SequencePatterns having < 2 occurrences. But SequencePattern is a subclass of tuple
and the idea is now to find the "two element" SequencePatterns.
The next thing I need to do is to go through all the one-element SequencePatterns which remain after weeding, identifying spots where you find two (or more) adjacent occurrences in the same sentence, i.e. where n_sentence
is the same and n_element
differs by 1.
So I need to do something along these lines:
occurrences_by_text_order = sorted( SequencePattern.seq_patterns.occurrences )
... but of course this doesn't work: I get
AttributeError: 'set' object has no attribute 'occurences'
Somehow I need to do an iteration of all SequencePatterns in seq_patterns and then, for each, a "nested" iteration of all occurrences for each of these... and I need to submit this mass of delivered tuples ( n_sentence, n_element
) to the sorted
function.
I'm not an experienced Pythonista but I have a suspicion this is a job for a generator (?). Can anyone help?
Upvotes: 0
Views: 66
Reputation: 15642
def get_occurrences():
for seq_patt in SequencePattern.seq_patterns:
for occurrence in seq_patt.occurrences:
yield occurrence
occurrences_by_text_order = sorted( get_occurrences() )
The following then prints out a list of all the two-element sequences which may occur more than once (we now know that there is no possibility of two-element sequences with frequency > 1 occurring anywhere else):
prev_occurrence = None
for occurrence in sorted( occurrence for seq_patt in SequencePattern.seq_patterns for occurrence in seq_patt.occurrences ):
if prev_occurrence and ( occurrence[ 0 ] == prev_occurrence[ 0 ] ) and ( occurrence[ 1 ] - prev_occurrence[ 1 ] == 1 ):
print( '# prev_occurrence %s occurrence: %s' % ( prev_occurrence, occurrence, ))
prev_occurrence = occurrence
Upvotes: 1