Tokenisation List Comprehension

Question

I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.

def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())

rcr = ReutersCorpusReader()    

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))

Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.

I also think I may have to input lower_sentences into my normalised_sentences.

Here is my normalise function:

def normalise(token):
    print(["NUM" if token.isdigit() 
    else "Nth" if re.fullmatch(r"[\d]+(st|nd|rd|th)", token) 
    else token for token in token])

Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.

IAmBullsaw · Accepted Answer

I see some things that would clear things up for you.

 lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
 normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here

Here you've forgotten to actually use the correct variable and you probably meant

 lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
 normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]

also since a list doesn't have the function lower(), you'd have to apply it for every token in each sentence, i.e

 lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]

Also, your normalise(token) is not returning anything, just using print. So the list comprehension

 normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here

does not produce a list of anything but None.

I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.

Tokenisation List Comprehension

Answers (2)

Related Questions