Thagor
Thagor

Reputation: 900

Storing a POS tagged corpus

I use NLTK and to POS tag the german Wikipedia with it.The structure is quite simple one big list containing every sentence as a list of word, POS tag tuples example:

[[(Word1,POS),(Word2,POS),...],[(Word1,POS),(Word2,POS),...],...]

Because the Wikipedia is big I obviously can not store the whole big list in memory so i need a way to save parts of it to disk. What would be a good way doing this in a way so that I can easily iterate over all the sentences and words later from disk?

Upvotes: 0

Views: 824

Answers (2)

alexis
alexis

Reputation: 50200

The proper thing to do is to save a tagged corpus in the format that the nltk's TaggedCorpusReader expects: Use a slash / to combine word and tag, and write each token separately. I.e., you'll end up with Word1/POS word2/POS word3/POS ....

For some reason the nltk doesn't provide a function that does that. There's a function to combine one word and its tag, which is not even worth the trouble to look up since it's easy enough to do the whole thing directly:

for tagged_sent in tagged_sentences:
    text = " ".join(w+"/"+t for w,t in tagged_sent)
    outfile.write(text+"\n")

That's it. Later you can use TaggedCorpusReader to read your corpus and iterate over it in the usual ways the NLTK provides (by tagged or untagged word, by tagged or untagged sentence).

Upvotes: 1

alvas
alvas

Reputation: 122092

Use pickle, see https://wiki.python.org/moin/UsingPickle:

import io
import cPickle as pickle

from nltk import pos_tag
from nltk.corpus import brown

print brown.sents()
print 

# Let's tag the first 10 sentences.
tagged_corpus = [pos_tag(i) for i in brown.sents()[:10]]

with io.open('brown.pos', 'wb') as fout:
    pickle.dump(tagged_corpus, fout)

with io.open('brown.pos', 'rb') as fin:
    loaded_corpus = pickle.load(fin)

for sent in loaded_corpus:
    print sent
    break

[out]:

[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]

[(u'The', 'DT'), (u'Fulton', 'NNP'), (u'County', 'NNP'), (u'Grand', 'NNP'), (u'Jury', 'NNP'), (u'said', 'VBD'), (u'Friday', 'NNP'), (u'an', 'DT'), (u'investigation', 'NN'), (u'of', 'IN'), (u"Atlanta's", 'JJ'), (u'recent', 'JJ'), (u'primary', 'JJ'), (u'election', 'NN'), (u'produced', 'VBN'), (u'``', '``'), (u'no', 'DT'), (u'evidence', 'NN'), (u"''", "''"), (u'that', 'WDT'), (u'any', 'DT'), (u'irregularities', 'NNS'), (u'took', 'VBD'), (u'place', 'NN'), (u'.', '.')]

Upvotes: 1

Related Questions