Reputation: 1813

NLTK stemmer occasionally including punctuation in stemmed word

Update:

Despite the rigorous cleaning, some words with periods are still being tokenized with the periods intact, including strings that are padded with spaces after between periods and quotation marks. I've created a public link with an example of the problem here in a Jupyter Notebook: https://drive.google.com/file/d/0B90qb2J7ZLYrZmItME5RRlhsVWM/view?usp=sharing

Or a shorter example:

word_tokenize('This is a test. "')
['This', 'is', 'a', 'test.', '``']

But disappears when the other type of double-quote is used:

word_tokenize('This is a test. ”')
['This', 'is', 'a', 'test', '.', '”']

Original:

I'm stemming a large corpus of text and created a counter to see the counts of each word, and I transferred that counter to a dataframe for easier handling. Each piece of text is a large string of between 100-5000 words. The dataframe with the word counts looks like this, taking words that only have counts of 11, for instance:

allwordsdf[(allwordsdf['count'] == 11)]


        words          count
551     throughlin     11
1921    rampd          11
1956    pinhol         11
2476    reckhow        11

What I've noticed is that there are a lot of words that weren't fully stemmed, and they have periods attached to the end. For instance:

4233    activist.   11
9243    storyline.  11

I'm not sure what accounts for this. I know it's typically stemming periods separately, because the period row stands at:

23  .   5702880

Also, it seems like it's not doing it for every instance of, say, 'activist.':

len(articles[articles['content'].str.contains('activist.')])
9600

Not sure if I'm overlooking something---yesterday I ran into a problem with the NLTK stemmer that was a bug, and I don't know if it's that or something I'm doing (always more likely).

Thanks for any guidance.

Edit:

Here's the function I'm using:

progress = 0
start = time.time()

def stem(x):
    end = time.time()
    tokens = word_tokenize(x)
    global start
    global progress
    progress += 1
    sys.stdout.write('\r {} percent, {} position, {} per second '.format(str(float(progress / len(articles))), 
                                                         str(progress), (1 / (end - start))))

    stems = [stemmer.stem(e) for e in tokens]
    start = time.time()
    return stems


articles['stems'] = articles.content.apply(lambda x: stem(x))

Edit 2:

Here is a JSON to some of the data: all the strings, tokens and stems.

And this is a snippet of what I get when I look for all the words, after tokenizing and stemming, that still have periods:

allwordsdf[allwordsdf['words'].str.contains('\.')] #dataframe made from the counter dict

    words   count
23  .       5702875
63  years.  1231
497 was.    281
798 lost.   157
817 jie.    1
819 teacher.24
858 domains.1
875 fallout.3
884 net.    23
889 option. 89
895 step.   67
927 pool.   30
936 that.   4245
954 compute.2
1001 dr.    11007
1010 decisions. 159

The length of that slice comes out to about 49,000.

Edit 3:

Alvas's answer helped cut down the number of words with periods by about half, to 24,000 unique words and a total count of 518980, which is a lot. The problem, as I discovered, is that it's doing it EVERY time there's a period and quotation mark. For instance, take the string 'sickened.`, which appears once in the tokenized words.

If I search the corpus:

articles[articles['content'].str.contains(r'sickened\.[^\s]')]

The only place in entire corupus it shows up is here:

...said he was “sickened.” Trump's running mate...

This is not an isolated incident, but is what I've seen over and over while searching for these terms. They have a quotation mark after them every time. The tokenizer can't just not handle words with character-period-quotation-character, but also character-period-quotation-whitespace.

Upvotes: 1

Answers (2)

alvas

Reputation: 122112

The code from the answer above works for clean text:

porter = PorterStemmer()
sents = ['This is a foo bar, sentence.', 'Yet another, foo bar!']
articles = pd.DataFrame(sents, columns=['content'])
articles['tokens'] = articles['content'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])

Looking at JSON file, you have very dirty data. Most probably when you scrapped the text from the website you didn't put spaces in between the <p>...</p> tags or section that you're extracting thus, it leads to chunks of text like:

“So [now] AlphaGo actually learns from its own searches to improve its neural networks, both the policy network and the value network, and this makes it learn in a much more general way. One of the things we’re most excited about is not just that it can play Go better but we hope that this’ll actually lead to technologies that are more generally applicable to other challenging domains.”AlphaGo is comprised of two networks: a policy network that selects the next move to play, and a value network that analyzes the probability of winning. The policy network was initially based on millions of historical moves from actual games played by Go professionals. But AlphaGo Master goes much further by searching through the possible moves that could occur if a particular move is played, increasing its understanding of the potential fallout.“The original system played against itself millions of times, but it didn’t have this component of using the search,” Hassabis tells The Verge. “[AlphaGo Master is] using its own strength to improve its own predictions. So whereas in the previous version it was mostly about generating data, in this version it’s actually using the power of its own search function and its own abilities to improve one part of itself, the policy net.”

Note that there are many instances where you have open quotes directly following a fullstop, e.g. domains.”AlphaGo.

And if you try to use the default NLTK word_tokenize function on this, you will get domains., ”, AlphaGo; i.e.

>>> from nltk import word_tokenize

>>> text = u"""“So [now] AlphaGo actually learns from its own searches to improve its neural networks, both the policy network and the value network, and this makes it learn in a much more general way. One of the things we’re most excited about is not just that it can play Go better but we hope that this’ll actually lead to technologies that are more generally applicable to other challenging domains.”AlphaGo is comprised of two networks: a policy network that selects the next move to play, and a value network that analyzes the probability of winning. The policy network was initially based on millions of historical moves from actual games played by Go professionals. But AlphaGo Master goes much further by searching through the possible moves that could occur if a particular move is played, increasing its understanding of the potential fallout.“The original system played against itself millions of times, but it didn’t have this component of using the search,” Hassabis tells The Verge. “[AlphaGo Master is] using its own strength to improve its own predictions. So whereas in the previous version it was mostly about generating data, in this version it’s actually using the power of its own search function and its own abilities to improve one part of itself, the policy net.”"""

>>> word_tokenize(text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']

>>> 'domains.' in word_tokenize(text)
True

So there are several ways to resolve this, here's a couple:

Try cleaning up your data before feeding them to the word_tokenize function, e.g. padding spaces between punctuations first
Try a different tokenizer, e.g. MosesTokenizer

Padding spaces between punctuations first

>>> import re
>>> clean_text = re.sub('([.,!?()])', r' \1 ', text)
>>> word_tokenize(clean_text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in word_tokenize(clean_text)
False

Using MosesTokenizer:

>>> from nltk.tokenize.moses import MosesTokenizer
>>> mo = MosesTokenizer()
>>> mo.tokenize(text)
[u'\u201c', u'So', u'&#91;', u'now', u'&#93;', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'&#91;', u'AlphaGo', u'Master', u'is', u'&#93;', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in mo.tokenize(text)
False

TL;DR

Use:

from nltk.tokenize.moses import MosesTokenizer
mo = MosesTokenizer()
articles['tokens'] = articles['content'].apply(mo.tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])

Or:

articles['clean'] = articles['content'].apply(lambda x: re.sub('([.,!?()])', r' \1 ', x)
articles['tokens'] = articles['clean'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])