Reputation: 1813
Despite the rigorous cleaning, some words with periods are still being tokenized with the periods intact, including strings that are padded with spaces after between periods and quotation marks. I've created a public link with an example of the problem here in a Jupyter Notebook: https://drive.google.com/file/d/0B90qb2J7ZLYrZmItME5RRlhsVWM/view?usp=sharing
Or a shorter example:
word_tokenize('This is a test. "')
['This', 'is', 'a', 'test.', '``']
But disappears when the other type of double-quote is used:
word_tokenize('This is a test. ”')
['This', 'is', 'a', 'test', '.', '”']
I'm stemming a large corpus of text and created a counter to see the counts of each word, and I transferred that counter to a dataframe for easier handling. Each piece of text is a large string of between 100-5000 words. The dataframe with the word counts looks like this, taking words that only have counts of 11, for instance:
allwordsdf[(allwordsdf['count'] == 11)]
words count
551 throughlin 11
1921 rampd 11
1956 pinhol 11
2476 reckhow 11
What I've noticed is that there are a lot of words that weren't fully stemmed, and they have periods attached to the end. For instance:
4233 activist. 11
9243 storyline. 11
I'm not sure what accounts for this. I know it's typically stemming periods separately, because the period row stands at:
23 . 5702880
Also, it seems like it's not doing it for every instance of, say, 'activist.':
len(articles[articles['content'].str.contains('activist.')])
9600
Not sure if I'm overlooking something---yesterday I ran into a problem with the NLTK stemmer that was a bug, and I don't know if it's that or something I'm doing (always more likely).
Thanks for any guidance.
Here's the function I'm using:
progress = 0
start = time.time()
def stem(x):
end = time.time()
tokens = word_tokenize(x)
global start
global progress
progress += 1
sys.stdout.write('\r {} percent, {} position, {} per second '.format(str(float(progress / len(articles))),
str(progress), (1 / (end - start))))
stems = [stemmer.stem(e) for e in tokens]
start = time.time()
return stems
articles['stems'] = articles.content.apply(lambda x: stem(x))
Here is a JSON to some of the data: all the strings, tokens and stems.
And this is a snippet of what I get when I look for all the words, after tokenizing and stemming, that still have periods:
allwordsdf[allwordsdf['words'].str.contains('\.')] #dataframe made from the counter dict
words count
23 . 5702875
63 years. 1231
497 was. 281
798 lost. 157
817 jie. 1
819 teacher.24
858 domains.1
875 fallout.3
884 net. 23
889 option. 89
895 step. 67
927 pool. 30
936 that. 4245
954 compute.2
1001 dr. 11007
1010 decisions. 159
The length of that slice comes out to about 49,000.
Alvas's answer helped cut down the number of words with periods by about half, to 24,000 unique words and a total count of 518980, which is a lot. The problem, as I discovered, is that it's doing it EVERY time there's a period and quotation mark. For instance, take the string 'sickened.`, which appears once in the tokenized words.
If I search the corpus:
articles[articles['content'].str.contains(r'sickened\.[^\s]')]
The only place in entire corupus it shows up is here:
...said he was “sickened.” Trump's running mate...
This is not an isolated incident, but is what I've seen over and over while searching for these terms. They have a quotation mark after them every time. The tokenizer can't just not handle words with character-period-quotation-character, but also character-period-quotation-whitespace.
Upvotes: 1
Views: 1214
Reputation: 122112
The code from the answer above works for clean text:
porter = PorterStemmer()
sents = ['This is a foo bar, sentence.', 'Yet another, foo bar!']
articles = pd.DataFrame(sents, columns=['content'])
articles['tokens'] = articles['content'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])
Looking at JSON file, you have very dirty data. Most probably when you scrapped the text from the website you didn't put spaces in between the <p>...</p>
tags or section that you're extracting thus, it leads to chunks of text like:
“So [now] AlphaGo actually learns from its own searches to improve its neural networks, both the policy network and the value network, and this makes it learn in a much more general way. One of the things we’re most excited about is not just that it can play Go better but we hope that this’ll actually lead to technologies that are more generally applicable to other challenging domains.”AlphaGo is comprised of two networks: a policy network that selects the next move to play, and a value network that analyzes the probability of winning. The policy network was initially based on millions of historical moves from actual games played by Go professionals. But AlphaGo Master goes much further by searching through the possible moves that could occur if a particular move is played, increasing its understanding of the potential fallout.“The original system played against itself millions of times, but it didn’t have this component of using the search,” Hassabis tells The Verge. “[AlphaGo Master is] using its own strength to improve its own predictions. So whereas in the previous version it was mostly about generating data, in this version it’s actually using the power of its own search function and its own abilities to improve one part of itself, the policy net.”
Note that there are many instances where you have open quotes directly following a fullstop, e.g. domains.”AlphaGo
.
And if you try to use the default NLTK word_tokenize
function on this, you will get domains.
, ”
, AlphaGo
; i.e.
>>> from nltk import word_tokenize
>>> text = u"""“So [now] AlphaGo actually learns from its own searches to improve its neural networks, both the policy network and the value network, and this makes it learn in a much more general way. One of the things we’re most excited about is not just that it can play Go better but we hope that this’ll actually lead to technologies that are more generally applicable to other challenging domains.”AlphaGo is comprised of two networks: a policy network that selects the next move to play, and a value network that analyzes the probability of winning. The policy network was initially based on millions of historical moves from actual games played by Go professionals. But AlphaGo Master goes much further by searching through the possible moves that could occur if a particular move is played, increasing its understanding of the potential fallout.“The original system played against itself millions of times, but it didn’t have this component of using the search,” Hassabis tells The Verge. “[AlphaGo Master is] using its own strength to improve its own predictions. So whereas in the previous version it was mostly about generating data, in this version it’s actually using the power of its own search function and its own abilities to improve one part of itself, the policy net.”"""
>>> word_tokenize(text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in word_tokenize(text)
True
So there are several ways to resolve this, here's a couple:
Try cleaning up your data before feeding them to the word_tokenize
function, e.g. padding spaces between punctuations first
Try a different tokenizer, e.g. MosesTokenizer
Padding spaces between punctuations first
>>> import re
>>> clean_text = re.sub('([.,!?()])', r' \1 ', text)
>>> word_tokenize(clean_text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in word_tokenize(clean_text)
False
Using MosesTokenizer
:
>>> from nltk.tokenize.moses import MosesTokenizer
>>> mo = MosesTokenizer()
>>> mo.tokenize(text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in mo.tokenize(text)
False
Use:
from nltk.tokenize.moses import MosesTokenizer
mo = MosesTokenizer()
articles['tokens'] = articles['content'].apply(mo.tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])
Or:
articles['clean'] = articles['content'].apply(lambda x: re.sub('([.,!?()])', r' \1 ', x)
articles['tokens'] = articles['clean'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])
Upvotes: 1
Reputation: 122112
You need to tokenize the string before stemming:
>>> from nltk.stem import PorterStemmer
>>> from nltk import word_tokenize
>>> text = 'This is a foo bar sentence, that contains punctuations.'
>>> porter = PorterStemmer()
>>> [porter.stem(word) for word in text.split()]
[u'thi', 'is', 'a', 'foo', 'bar', 'sentence,', 'that', u'contain', 'punctuations.']
>>> [porter.stem(word) for word in word_tokenize(text)]
[u'thi', 'is', 'a', 'foo', 'bar', u'sentenc', ',', 'that', u'contain', u'punctuat', '.']
In a dataframe:
porter = PorterStemmer()
articles['tokens'] = articles['content'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])
>>> import pandas as pd
>>> from nltk.stem import PorterStemmer
>>> from nltk import word_tokenize
>>> sents = ['This is a foo bar, sentence.', 'Yet another, foo bar!']
>>> df = pd.DataFrame(sents, columns=['content'])
>>> df
content
0 This is a foo bar, sentence.
1 Yet another, foo bar!
# Apply tokenizer.
>>> df['tokens'] = df['content'].apply(word_tokenize)
>>> df
content tokens
0 This is a foo bar, sentence. [This, is, a, foo, bar, ,, sentence, .]
1 Yet another, foo bar! [Yet, another, ,, foo, bar, !]
# Without DataFrame.apply
>>> df['tokens'][0]
['This', 'is', 'a', 'foo', 'bar', ',', 'sentence', '.']
>>> [porter.stem(word) for word in df['tokens'][0]]
[u'thi', 'is', 'a', 'foo', 'bar', ',', u'sentenc', '.']
# With DataFrame.apply
>>> df['tokens'].apply(lambda row: [porter.stem(word) for word in row])
0 [thi, is, a, foo, bar, ,, sentenc, .]
1 [yet, anoth, ,, foo, bar, !]
# Or if you like nested lambdas.
>>> df['tokens'].apply(lambda x: map(lambda y: porter.stem(y), x))
0 [thi, is, a, foo, bar, ,, sentenc, .]
1 [yet, anoth, ,, foo, bar, !]
Upvotes: 2