Reputation: 69
I am new to Text mining and NLP related stuffs.I am working on a small project where I am trying to extract information out of a few documents.I am basically doing a pos tagging and then using a chunker to find out the pattern based on the tagged words.Do I need to use Stopwords before doing this POS tagging?will using stopwords affect my POS tagger's accuracy?
Upvotes: 3
Views: 3289
Reputation: 7
I would advice you to use POS taggers before removing stop words as POS tagging is performed as sequence classification,So changing the sequence by removing stop words will very likely affect the POS tag of remaining words.
Upvotes: 0
Reputation: 122240
Let's use this as an example to train/test a tagger:
First get the corpus and stoplist
>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('cess_esp')
Load the wrappers in NLTK
>>> from nltk.corpus import cess_esp as cess
>>> from nltk.corpus import stopwords
# Import the function to train a tagger.
>>> from nltk import UnigramTagger, BigramTagger
# Load the Spanish stopwords
>>> stoplist = stopwords.words('spanish')
# Load the Spanish tagger
>>> cess_sents = cess.tagged_sents()
Split the corpus into train/test sets
>>> len(cess_sents)
6030
>>> test_set = cess_sents[-int(6030/10):]
>>> train_set = cess_sents[:-int(6030/10)]
>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> range(10)[-2:]
[8, 9]
>>> range(10)[:-2]
[0, 1, 2, 3, 4, 5, 6, 7]
Create an alternate train_set without stopwords.
>>> train_set_nostop = [[(word,tag) for word, tag in sent if word.lower() not in stoplist] for sent in train_set]
See the difference:
>>> train_set[0]
[(u'El', u'da0ms0'), (u'grupo', u'ncms000'), (u'estatal', u'aq0cs0'), (u'Electricit\xe9_de_France', u'np00000'), (u'-Fpa-', u'Fpa'), (u'EDF', u'np00000'), (u'-Fpt-', u'Fpt'), (u'anunci\xf3', u'vmis3s0'), (u'hoy', u'rg'), (u',', u'Fc'), (u'jueves', u'W'), (u',', u'Fc'), (u'la', u'da0fs0'), (u'compra', u'ncfs000'), (u'del', u'spcms'), (u'51_por_ciento', u'Zp'), (u'de', u'sps00'), (u'la', u'da0fs0'), (u'empresa', u'ncfs000'), (u'mexicana', u'aq0fs0'), (u'Electricidad_\xc1guila_de_Altamira', u'np00000'), (u'-Fpa-', u'Fpa'), (u'EAA', u'np00000'), (u'-Fpt-', u'Fpt'), (u',', u'Fc'), (u'creada', u'aq0fsp'), (u'por', u'sps00'), (u'el', u'da0ms0'), (u'japon\xe9s', u'aq0ms0'), (u'Mitsubishi_Corporation', u'np00000'), (u'para', u'sps00'), (u'poner_en_marcha', u'vmn0000'), (u'una', u'di0fs0'), (u'central', u'ncfs000'), (u'de', u'sps00'), (u'gas', u'ncms000'), (u'de', u'sps00'), (u'495', u'Z'), (u'megavatios', u'ncmp000'), (u'.', u'Fp')]
>>> train_set_nostop[0]
[(u'grupo', u'ncms000'), (u'estatal', u'aq0cs0'), (u'Electricit\xe9_de_France', u'np00000'), (u'-Fpa-', u'Fpa'), (u'EDF', u'np00000'), (u'-Fpt-', u'Fpt'), (u'anunci\xf3', u'vmis3s0'), (u'hoy', u'rg'), (u',', u'Fc'), (u'jueves', u'W'), (u',', u'Fc'), (u'compra', u'ncfs000'), (u'51_por_ciento', u'Zp'), (u'empresa', u'ncfs000'), (u'mexicana', u'aq0fs0'), (u'Electricidad_\xc1guila_de_Altamira', u'np00000'), (u'-Fpa-', u'Fpa'), (u'EAA', u'np00000'), (u'-Fpt-', u'Fpt'), (u',', u'Fc'), (u'creada', u'aq0fsp'), (u'japon\xe9s', u'aq0ms0'), (u'Mitsubishi_Corporation', u'np00000'), (u'poner_en_marcha', u'vmn0000'), (u'central', u'ncfs000'), (u'gas', u'ncms000'), (u'495', u'Z'), (u'megavatios', u'ncmp000'), (u'.', u'Fp')]
>>>
Train a tagger:
>>> uni_tag = UnigramTagger(train_set)
Train a tagger with corpus without stopwords:
>>> uni_tag_nostop = UnigramTagger(train_set_nostop)
Splits the test_set into words and tags:
>>> test_words, test_tags = zip(*[zip(*sent) for sent in test_set])
Tag the test sentences:
>>> uni_tag.tag_sents(test_words)
>>> uni_tag_nostop.tag_sents(test_words)
Evaluate the accuracy (let's just do true positives for now):
>>> sum([ sum(1 for (word,pred_tag), (word, gold_tag) in zip(pred,gold) if pred_tag==gold_tag) for pred, gold in zip(tagged_sents, test_set)])
11266
>>> sum([ sum(1 for (word,pred_tag), (word, gold_tag) in zip(pred,gold) if pred_tag==gold_tag) for pred, gold in zip(tagged_sents_nostop, test_set)])
5963
Note there are many things that are unfair here when you removed the stopwords before training the tagger, not exhaustively:
your training set will naturally be smaller since the no. of words in the sentence is smaller after removing the stopwords
the tagger will not learn the tags for the stopwords and hence will return None for all stopwords, reducing your tagger's accuracy since the test set does include stopwords
when training a higher order ngram, without the stopwords, it might not make any sense at all. Not that grammaticality or sensibility accounts for accuracy (esp. in today's NLP). For e.g. , "the cat is on the table" -> "cat table" without stopwords.
But as @alexia pointed out, for bag-of-words based vector space models (aka as distributed models, aka. "you can know a word by its neighbors" model, aka. the non-neural prediction embedding model), removing the stopwords might bring you some mileage in terms of accuracy. But as for TF-IDF, the (statistically) magical thing is that the stopwords will automatically have a low TF-IDF score since they appear too frequently in most documents and don't that makes them have less discriminatory properties to make each document different (so they are not that important, it's the IDF parts that's doing the magic).
Upvotes: 5
Reputation: 50220
What @lenz said. Do not remove the stopwords before you tag -- or before you chunk, for that matter, unless you are training a chunker and you decide to train it (and then use it) on cleaned text. But I wouldn't recommend that either. Stopword removal is appropriate for bag-of-words processes like TF-IDF, but common words like determiners and prepositions provide essential clues about sentence structure, and hence part of speech. Do not remove them if you want to detect sentence units.
But why take my word for it? You can easily check this for yourself, by taking a bit of tagged data and evaluating your tagger and chunker with and without stopword removal. I recommend you do this anyway for the rest of your pipeline.
Upvotes: 1