MBit
MBit

Reputation: 213

Python: TaggedCorpusReader how to get from a STTS to a Universal tagset

I'm working on a POS Tagger using Python and Keras. The data I've got is using the STTS Tags, but I'm supposed to create a Tagger for the universal tagset. So I need to translate this.

First I thought of making a dictionary and simply search replace the tags, but then I saw the option of setting a tagset using the TaggedCorpusReader. (e.g. 'brown')

But I miss a list of possible tagsets that can be used there. Can I use the STTS Tagset somehow or do I have to make a dictionary myself?

Example Source: Code #3 : map corpus tags to the universal tagset https://www.geeksforgeeks.org/nlp-customization-using-tagged-corpus-reader/

corpus = TaggedCorpusReader(filePath, "standard_pos_tagged.txt", tagset='STTS') #?? doesn't work sadly
# ....
trainingCorpus.tagged_sents(tagset='universal')[1]

In the end it looked something like this: (big thanks to alexis)

with open(resultFileName, "w") as output:
    for sent in stts_corpus.tagged_sents():
        for word, tag in sent:
            try:
                newTag = mapping_dict[tag];
                output.write(word+"/"+newTag+" ")               
            except:
                print("except "  + str(word) + " - " + str(tag))
        output.write("\n")

Upvotes: 0

Views: 282

Answers (1)

alexis
alexis

Reputation: 50200

Just create a dictionary and replace the tags, as you considered doing. The nltk's universal tagset support is provided by the module nltk/tag/mapping.py. It relies on a set of mapping files, which you will find in NLTK_DATA/taggers/universal_tagset. For example, in en-brown.map you'll find lines like this, which map a whole bunch of tags to PRT, ABX to DET, and so on:

ABL     PRT
ABN     PRT
ABN-HL  PRT
ABN-NC  PRT
ABN-TL  PRT
ABX     DET
AP      ADJ

These files are read into a dictionary that is used for the translation. By creating a mapping file in the same format you could use the nltk's functions to perform the translation, but honestly if your task is simply to produce a corpus in the Universal format, I would just do the translation by hand. But not through "search-replace": Work with the tuples provided by the nltk's corpus readers, and just replace the POS tags by direct lookup in your mapping dictionary.

Let's assume you know how to persuade an nltk TaggedCorpusReader to read your corpus, and you now have an stts_corpus reader object with methods tagged_words(), tagged_sents(), etc. You also need the mapping dictionary, whose keys are STTS tags and values are universal tags; if ABL was an STTS tag, mapping_dict["ABL"] should return the value PRT. Your remapping then goes something like this:

for filename in stts_corpus.fileids():
    with open("new_dir/"+filename, "w") as output:
        for word, tag in stts_corpus.tagged_words():
            output.write(word+"/"+mapping_dict[tag]+" ")
        output.write("\n")

And that's really all there is to it, unless you want to add luxuries like breaking the text into lines.

Upvotes: 1

Related Questions