Reputation: 172
I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.
def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())
rcr = ReutersCorpusReader()
sample_size = 10000
raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here
raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))
Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.
I also think I may have to input lower_sentences into my normalised_sentences.
Here is my normalise function:
def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[\d]+(st|nd|rd|th)", token)
else token for token in token])
Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.
Upvotes: 0
Views: 278
Reputation: 93
I see some things that would clear things up for you.
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
Here you've forgotten to actually use the correct variable and you probably meant
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
also since a list doesn't have the function lower()
, you'd have to apply it for every token in each sentence, i.e
lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]
Also, your normalise(token)
is not returning anything, just using print. So the list comprehension
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
does not produce a list of anything but None
.
I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.
Upvotes: 3
Reputation: 12581
You appear to be using the wrong variable in your comprehensions:
# Wrong
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
However, if you want to normalise your lower-case sentences, we need to change that line too:
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]
Upvotes: 2