Reputation: 1238
I am working in Python 3.6 with NLTK 3.2.
I am trying to write a program which takes raw text as input and outputs any (maximum) series of consecutive words beginning with the same letter (i.e. alliterative sequences).
When searching for sequences, I want to ignore certain words and punctuation (for instance, 'it', 'that', 'into', ''s', ',', and '.'), but to include them in the output.
For example, inputting
"The door was ajar. So it seems that Sam snuck into Sally's subaru."
should yield
["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]
I am new to programming and the best I could come up with is:
import nltk
from nltk import word_tokenize
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru."
tokened_text = word_tokenize(raw) #word tokenize the raw text with NLTK's word_tokenize() function
tokened_text = [w.lower() for w in tokened_text] #make it lowercase
for w in tokened_text: #for each word of the text
letter = w[0] #consider its first letter
allit_str = []
allit_str.append(w) #add that word to a list
pos = tokened_text.index(w) #let "pos" be the position of the word being considered
for i in range(1,len(tokened_text)-pos): #consider the next word
if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}: #if it's one of these
allit_str.append(tokened_text[pos+i]) #add it to the list
i=+1 #and move on to the next word
elif tokened_text[pos+i][0] == letter: #or else, if the first letter is the same
allit_str.append(tokened_text[pos+i]) #add the word to the list
i=+1 #and move on to the next word
else: #or else, if the letter is different
break #break the for loop
if len(allit_str)>=2: #if the list has two or more members
print(allit_str) #print it
which outputs
['ajar', '.']
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['snuck', 'into', 'sally', "'s", 'subaru', '.']
['sally', "'s", 'subaru', '.']
['subaru', '.']
This is close to what I want, except that I don't know how to restrict the program to only print the maximum sequences.
So my questions are:
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
?Here are similar questions asked elsewhere, but which have not helped me modify my code:
(I also think it would be nice to have this question answered on this site.)
Upvotes: 2
Views: 2677
Reputation: 264
The accepted answer is very comprehensive, but I would suggest using Carnegie Mellon's pronouncing dictionary. This is partly because it makes life easier, and partly because identical sounding syllables that are not necessarily identical letter-to-letter are also considered alliterations. An example I found online (https://examples.yourdictionary.com/alliteration-examples.html) is "Finn fell for Phoebe".
# nltk.download('cmudict') ## download CMUdict for phoneme set
# The phoneme dictionary consists of ARPABET which encode
# vowels, consonants, and a representitive stress-level (wiki/ARPABET)
phoneme_dictionary = nltk.corpus.cmudict.dict()
stress_symbols = ['0', '1', '2', '3...', '-', '!', '+', '/',
'#', ':', ':1', '.', ':2', '?', ':3']
# nltk.download('stopwords') ## download stopwords (the, a, of, ...)
# Get stopwords that will be discarded in comparison
stopwords = nltk.corpus.stopwords.words("english")
# Function for removing all punctuation marks (. , ! * etc.)
no_punct = lambda x: re.sub(r'[^\w\s]', '', x)
def get_phonemes(word):
if word in phoneme_dictionary:
return phoneme_dictionary[word][0] # return first entry by convention
else:
return ["NONE"] # no entries found for input word
def get_alliteration_level(text): # alliteration based on sound, not only letter!
count, total_words = 0, 0
proximity = 2 # max phonemes to compare to for consideration of alliteration
i = 0 # index for placing phonemes into current_phonemes
lines = text.split(sep="\n")
for line in lines:
current_phonemes = [None] * proximity
for word in line.split(sep=" "):
word = no_punct(word) # remove punctuation marks for correct identification
total_words += 1
if word not in stopwords:
if (get_phonemes(word)[0] in current_phonemes): # alliteration occurred
count += 1
current_phonemes[i] = get_phonemes(word)[0] # update new comparison phoneme
i = 0 if i == 1 else 1 # update storage index
alliteration_score = count / total_words
return alliteration_score
Above is the proposed script. The variable proximity
is introduced so that we consider syllables in alliteration, that are otherwise separated by multiple words. The stress_symbols
variables reflect stress levels indicated on the CMU dictionary, and it could be easily incorporated in to the function.
Upvotes: 0
Reputation: 598
Interesting task. Personally, I'd loop through without the use of indices, keeping track of the previous word to compare it with the current word.
Additionally, it's not enough to compare letters; you have to take into account that 's' and 'sh' etc don't alliterate. Here's my attempt:
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords
import string
from collections import defaultdict, OrderedDict
import operator
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru. She seems shy sometimes. Someone save Simon."
# Get the English alphabet as a list of letters
letters = [letter for letter in string.ascii_lowercase]
# Here we add some extra phonemes that are distinguishable in text.
# ('sailboat' and 'shark' don't alliterate, for instance)
# Digraphs go first as we need to try matching these before the individual letters,
# and break out if found.
sounds = ["ch", "ph", "sh", "th"] + letters
# Use NLTK's built in stopwords and add "'s" to them
stopwords = stopwords.words('english') + ["'s"] # add extra stopwords here
stopwords = set(stopwords) # sets are MUCH faster to process
sents = sent_tokenize(raw)
alliterating_sents = defaultdict(list)
for sent in sents:
tokenized_sent = word_tokenize(sent)
# Create list of alliterating word sequences
alliterating_words = []
previous_initial_sound = ""
for word in tokenized_sent:
for sound in sounds:
if word.lower().startswith(sound): # only lowercasing when comparing retains original case
initial_sound = sound
if initial_sound == previous_initial_sound:
if len(alliterating_words) > 0:
if previous_word == alliterating_words[-1]: # prevents duplication in chains of more than 2 alliterations, but assumes repetition is not alliteration)
alliterating_words.append(word)
else:
alliterating_words.append(previous_word)
alliterating_words.append(word)
else:
alliterating_words.append(previous_word)
alliterating_words.append(word)
break # Allows us to treat sh/s distinctly
# This needs to be at the end of the loop
# It sets us up for the next iteration
if word not in stopwords: # ignores stopwords for the purpose of determining alliteration
previous_initial_sound = initial_sound
previous_word = word
alliterating_sents[len(alliterating_words)].append(sent)
sorted_alliterating_sents = OrderedDict(sorted(alliterating_sents.items(), key=operator.itemgetter(0), reverse=True))
# OUTPUT
print ("A sorted ordered dict of sentences by number of alliterations:")
print (sorted_alliterating_sents)
print ("-" * 15)
max_key = max([k for k in sorted_alliterating_sents]) # to get sent with max alliteration
print ("Sentence(s) with most alliteration:", sorted_alliterating_sents[max_key])
This produces a sorted ordered dictionary of sentences with their alliteration counts as its keys. The max_key
variable contains the count for the highest alliterating sentence or sentences, and can be used to access the sentences themselves.
Upvotes: 2