Reputation: 900
I use NLTK and to POS tag the german Wikipedia with it.The structure is quite simple one big list containing every sentence as a list of word, POS tag tuples example:
[[(Word1,POS),(Word2,POS),...],[(Word1,POS),(Word2,POS),...],...]
Because the Wikipedia is big I obviously can not store the whole big list in memory so i need a way to save parts of it to disk. What would be a good way doing this in a way so that I can easily iterate over all the sentences and words later from disk?
Upvotes: 0
Views: 824
Reputation: 50200
The proper thing to do is to save a tagged corpus in the format that the nltk's TaggedCorpusReader
expects: Use a slash /
to combine word and tag, and write each token separately. I.e., you'll end up with Word1/POS word2/POS word3/POS ...
.
For some reason the nltk doesn't provide a function that does that. There's a function to combine one word and its tag, which is not even worth the trouble to look up since it's easy enough to do the whole thing directly:
for tagged_sent in tagged_sentences:
text = " ".join(w+"/"+t for w,t in tagged_sent)
outfile.write(text+"\n")
That's it. Later you can use TaggedCorpusReader
to read your corpus and iterate over it in the usual ways the NLTK provides (by tagged or untagged word, by tagged or untagged sentence).
Upvotes: 1
Reputation: 122092
Use pickle
, see https://wiki.python.org/moin/UsingPickle:
import io
import cPickle as pickle
from nltk import pos_tag
from nltk.corpus import brown
print brown.sents()
print
# Let's tag the first 10 sentences.
tagged_corpus = [pos_tag(i) for i in brown.sents()[:10]]
with io.open('brown.pos', 'wb') as fout:
pickle.dump(tagged_corpus, fout)
with io.open('brown.pos', 'rb') as fin:
loaded_corpus = pickle.load(fin)
for sent in loaded_corpus:
print sent
break
[out]:
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]
[(u'The', 'DT'), (u'Fulton', 'NNP'), (u'County', 'NNP'), (u'Grand', 'NNP'), (u'Jury', 'NNP'), (u'said', 'VBD'), (u'Friday', 'NNP'), (u'an', 'DT'), (u'investigation', 'NN'), (u'of', 'IN'), (u"Atlanta's", 'JJ'), (u'recent', 'JJ'), (u'primary', 'JJ'), (u'election', 'NN'), (u'produced', 'VBN'), (u'``', '``'), (u'no', 'DT'), (u'evidence', 'NN'), (u"''", "''"), (u'that', 'WDT'), (u'any', 'DT'), (u'irregularities', 'NNS'), (u'took', 'VBD'), (u'place', 'NN'), (u'.', '.')]
Upvotes: 1