Reputation: 21
I used nltk for part of speech tagging. It has 36 Penn Treebank. I want to reduce the number of tags to 6 :"noun, verb, adjective, adverb, preposition, conjunction" How should I do so? Is there any specific function attribute? or command?
Upvotes: 2
Views: 1941
Reputation: 354
You cannot reduce to these 6 tags, because there will be an "other" category for things like determiners or pronouns that cannot be directly reduced to any of the categories you mention.
Having that said, the short answer is:
The long answer:
To reduce the tags to your "target tags", you can use the Ontologies of Linguistic Annotation [disclosure: I'm maintaining these] with the following SPARQL query:
PREFIX system: <http://purl.org/olia/system.owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX olia: <http://purl.org/olia/olia.owl#>
# columns of the mapping table
SELECT distinct ?tag ?category
# lookup in the Ontologies of Linguistic Annotation
FROM <http://purl.org/olia/penn.owl> # Penn tags
FROM <http://purl.org/olia/olia.owl> # reference concepts (Noun etc.)
FROM <http://purl.org/olia/penn-link.rdf> # Penn -> reference concepts
# the actual query
WHERE {
# for an element with a particular tag
?a system:hasTag ?tag.
# retrieve all its super classes
OPTIONAL {
?a a/(rdfs:subClassOf|owl:equivalentClass|
owl:unionOf|owl:intersectionOf)* ?b.
# but only if they match your target categories
# see http://purl.org/olia/olia.owl for their definitions
FILTER(?b in (
olia:Noun, olia:Verb, olia:Adjective,
olia:Adverb, olia:Preposition,
olia:Conjunction
))
}
# return the local name of the target category
# if none of your target categories can be found, return "OTHER"
BIND(if(bound(?b), replace(str(?b),".*[#/]",""), "OTHER") AS ?category)
}
ORDER BY ?tag
See inline comments for explanation. You can adjust the filter conditions to get more, fewer or other categories. Note that this query can return multiple mappings if Penn tags are ambiguous (disjunction, i.e. owl:unionOf
).
No need to set up your own end point for such occasional queries, just go to http://sparql.org/sparql.html and copy and paste (and edit) that query. Different output formats are possible, select "Output XML" and the default XSL stylesheet to get a HTML view.
The entire query can be condensed into a single URI (as above). You can customize your query and output formats, click on "Get Results" and copy the URL of the resulting page. (Or build it yourself, using standard URI escaping.)
Note that whenever you click on that link, you run a live query. Better do that once and store your mapping table.
Note that the complex expression (rdfs:subClassOf|owl:equivalentClass| owl:unionOf|owl:intersectionOf)*
allows you to search over OWL axioms. However, this is search, not reasoning, so you will only retrieve classes that are explicitly defined as superclasses.
Note that owl:unionOf
is a logical or. There is no way to disambiguate that by means of a SPARQL query, if you want to treat tags with ambiguous definitions as OTHER
, remove that expression from the property path.
Also note that this is not restricted to Penn, OLiA supports tagsets for more than 100 languages, see http://purl.org/olia
Upvotes: 0
Reputation: 2758
I recommend you to use the tagset_mapping
method. If you ask it to map from en-ptb
(the Penn Treebank PoS) to universal
you will reduce the number of PoS tags.
This is a very simple example to see how to incorporate the method:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.tag.mapping import tagset_mapping
PTB_UNIVERSAL_MAP = tagset_mapping('en-ptb', 'universal')
def to_universal(tagged_words):
return [(word, PTB_UNIVERSAL_MAP[tag]) for word, tag in tagged_words]
text = "This is a very simple example."
pos_tagged = [(word, tag) for word, tag in pos_tag(word_tokenize(text))]
You can observe the difference before and after the mapping:
print(pos_tagged)
>>>[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('very', 'RB'), ('simple', 'JJ'), ('example', 'NN'), ('.', '.')]
print(to_universal(pos_tagged))
>>> [('This', 'DET'), ('is', 'VERB'), ('a', 'DET'), ('very', 'ADV'), ('simple', 'ADJ'), ('example', 'NOUN'), ('.', '.')]
I would advice you to stick to this mapping, even though there are more resultant tags than desired. This way you'll follow sort of a "convention". Besides, the "extra" tags are mostly about punctuation.
In case you strictly want to map to your fixed set "noun, verb, adjective, adverb, preposition, conjunction" you can always use the map_tag method.
Notice you might have to download extra resources:
import nltk
nltk.download('universal_tagset')
Upvotes: 1
Reputation: 121992
The UPenn tagset documentation can be accessed as such:
>>> import nltk
>>> nltk.help.upenn_tagset()
What are all possible pos tags of NLTK? has a good detailed discussion/description of it.
Note that while the Wall Street Journal (wsj
) subset of the Penn Treebank (PTB) uses the UPenn tagset, the brown
corpus (a subset of the PTB) has a finer grain tagset:
>>> nltk.help.brown_tagset()
Although the original PTB has the upenn
and brown
tags, the tags in the treebank
corpus can be mapped. As @alexis has shown, the Universal Tagset of the PTB corpus can be accessed as such:
treebank.tagged_sents(tagset="universal")
They are mapped to the Universal Tagset by the nltk.tag.mapping.tagset_mapping
using the mapping resources from nltk_data/taggers/universal_tagset/en-*.map
files:
~/nltk_data/taggers/universal_tagset$ ls
README de-negra.map en-tweet.map fi-tdt.map ja-verbmobil.map sl-sdt.map
ar-padt.map de-tiger.map es-cast3lb.map fr-paris.map ko-sejong.map sv-talbanken.map
bg-btb.map el-gdt.map es-eagles.map hu-szeged.map nl-alpino.map tu-metusbanci.map
ca-cat3lb.map en-brown.map es-iula.map it-isst.map pl-ipipan.map universal_tags.py
cs-pdt.map en-ptb.map es-treetagger.map iw-mila.map pt-bosque.map zh-ctb6.map
da-ddt.map en-tweet.README eu-eus3lb.map ja-kyoto.map ru-rnc.map zh-sinica.map
Upvotes: 1
Reputation: 50190
Ask for the "universal" tagset:
treebank.tagged_sents(tagset="universal")
It's not quite the list you specify (e.g., it didn't forget about determiners), but it comes close. If you still don't like it, you can rename the rest of the POS tags yourself.
Upvotes: 1