DyingIsFun
DyingIsFun

Reputation: 1238

Finding Alliterative Word Sequences with Python

I am working in Python 3.6 with NLTK 3.2.

I am trying to write a program which takes raw text as input and outputs any (maximum) series of consecutive words beginning with the same letter (i.e. alliterative sequences).

When searching for sequences, I want to ignore certain words and punctuation (for instance, 'it', 'that', 'into', ''s', ',', and '.'), but to include them in the output.

For example, inputting

"The door was ajar. So it seems that Sam snuck into Sally's subaru."

should yield

["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]

I am new to programming and the best I could come up with is:

import nltk
from nltk import word_tokenize

raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru."

tokened_text = word_tokenize(raw)                   #word tokenize the raw text with NLTK's word_tokenize() function
tokened_text = [w.lower() for w in tokened_text]    #make it lowercase

for w in tokened_text:                              #for each word of the text
    letter = w[0]                                   #consider its first letter
    allit_str = []
    allit_str.append(w)                             #add that word to a list
    pos = tokened_text.index(w)                     #let "pos" be the position of the word being considered
    for i in range(1,len(tokened_text)-pos):        #consider the next word
        if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}:   #if it's one of these
            allit_str.append(tokened_text[pos+i])   #add it to the list
            i=+1                                    #and move on to the next word
        elif tokened_text[pos+i][0] == letter:      #or else, if the first letter is the same
            allit_str.append(tokened_text[pos+i])   #add the word to the list
            i=+1                                    #and move on to the next word
        else:                                       #or else, if the letter is different
            break                                   #break the for loop
    if len(allit_str)>=2:                           #if the list has two or more members
        print(allit_str)                            #print it

which outputs

['ajar', '.']
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['snuck', 'into', 'sally', "'s", 'subaru', '.']
['sally', "'s", 'subaru', '.']
['subaru', '.']

This is close to what I want, except that I don't know how to restrict the program to only print the maximum sequences.

So my questions are:

  1. How can I modify this code to only print the maximum sequence ['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']?
  2. Is there an easier way to do this in Python, maybe with regular expression or more elegant code?

Here are similar questions asked elsewhere, but which have not helped me modify my code:

(I also think it would be nice to have this question answered on this site.)

Upvotes: 2

Views: 2677

Answers (2)

Uzay Macar
Uzay Macar

Reputation: 264

The accepted answer is very comprehensive, but I would suggest using Carnegie Mellon's pronouncing dictionary. This is partly because it makes life easier, and partly because identical sounding syllables that are not necessarily identical letter-to-letter are also considered alliterations. An example I found online (https://examples.yourdictionary.com/alliteration-examples.html) is "Finn fell for Phoebe".

# nltk.download('cmudict') ## download CMUdict for phoneme set
# The phoneme dictionary consists of ARPABET which encode
# vowels, consonants, and a representitive stress-level (wiki/ARPABET)
phoneme_dictionary = nltk.corpus.cmudict.dict()
stress_symbols = ['0', '1', '2', '3...', '-', '!', '+', '/',
                      '#', ':', ':1', '.', ':2', '?', ':3']

# nltk.download('stopwords') ## download stopwords (the, a, of, ...)
# Get stopwords that will be discarded in comparison
stopwords = nltk.corpus.stopwords.words("english")
# Function for removing all punctuation marks (. , ! * etc.)
no_punct = lambda x: re.sub(r'[^\w\s]', '', x)

def get_phonemes(word):
    if word in phoneme_dictionary:
        return phoneme_dictionary[word][0] # return first entry by convention
    else:
        return ["NONE"] # no entries found for input word

def get_alliteration_level(text): # alliteration based on sound, not only letter!
    count, total_words = 0, 0
    proximity = 2 # max phonemes to compare to for consideration of alliteration
    i = 0 # index for placing phonemes into current_phonemes
    lines = text.split(sep="\n")
    for line in lines:
        current_phonemes = [None] * proximity
        for word in line.split(sep=" "):
            word = no_punct(word) # remove punctuation marks for correct identification
            total_words += 1
            if word not in stopwords:
                if (get_phonemes(word)[0] in current_phonemes): # alliteration occurred
                    count += 1
                current_phonemes[i] = get_phonemes(word)[0] # update new comparison phoneme
                i = 0 if i == 1 else 1 # update storage index

    alliteration_score = count / total_words
    return alliteration_score

Above is the proposed script. The variable proximity is introduced so that we consider syllables in alliteration, that are otherwise separated by multiple words. The stress_symbols variables reflect stress levels indicated on the CMU dictionary, and it could be easily incorporated in to the function.

Upvotes: 0

PrettyHands
PrettyHands

Reputation: 598

Interesting task. Personally, I'd loop through without the use of indices, keeping track of the previous word to compare it with the current word.

Additionally, it's not enough to compare letters; you have to take into account that 's' and 'sh' etc don't alliterate. Here's my attempt:

import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords
import string
from collections import defaultdict, OrderedDict
import operator

raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru. She seems shy sometimes. Someone save Simon."

# Get the English alphabet as a list of letters
letters = [letter for letter in string.ascii_lowercase] 

# Here we add some extra phonemes that are distinguishable in text.
# ('sailboat' and 'shark' don't alliterate, for instance)
# Digraphs go first as we need to try matching these before the individual letters,
# and break out if found.
sounds = ["ch", "ph", "sh", "th"] + letters 

# Use NLTK's built in stopwords and add "'s" to them
stopwords = stopwords.words('english') + ["'s"] # add extra stopwords here
stopwords = set(stopwords) # sets are MUCH faster to process

sents = sent_tokenize(raw)

alliterating_sents = defaultdict(list)
for sent in sents:
    tokenized_sent = word_tokenize(sent)

    # Create list of alliterating word sequences
    alliterating_words = []
    previous_initial_sound = ""
    for word in tokenized_sent:
        for sound in sounds:
            if word.lower().startswith(sound): # only lowercasing when comparing retains original case
                initial_sound = sound
                if initial_sound == previous_initial_sound:
                    if len(alliterating_words) > 0:
                        if previous_word == alliterating_words[-1]: # prevents duplication in chains of more than 2 alliterations, but assumes repetition is not alliteration)
                            alliterating_words.append(word)
                        else:
                            alliterating_words.append(previous_word)
                            alliterating_words.append(word)
                    else:
                        alliterating_words.append(previous_word)
                        alliterating_words.append(word)                
                break # Allows us to treat sh/s distinctly

        # This needs to be at the end of the loop
        # It sets us up for the next iteration
        if word not in stopwords: # ignores stopwords for the purpose of determining alliteration
            previous_initial_sound = initial_sound
            previous_word = word

    alliterating_sents[len(alliterating_words)].append(sent)

sorted_alliterating_sents = OrderedDict(sorted(alliterating_sents.items(), key=operator.itemgetter(0), reverse=True))

# OUTPUT
print ("A sorted ordered dict of sentences by number of alliterations:")
print (sorted_alliterating_sents)
print ("-" * 15)
max_key = max([k for k in sorted_alliterating_sents]) # to get sent with max alliteration 
print ("Sentence(s) with most alliteration:", sorted_alliterating_sents[max_key])

This produces a sorted ordered dictionary of sentences with their alliteration counts as its keys. The max_key variable contains the count for the highest alliterating sentence or sentences, and can be used to access the sentences themselves.

Upvotes: 2

Related Questions