Prepare data for scikit-learn

Question

I am working on a small NLP project of authorship attribution: I have some texts from two authors and I want to say who wrote them.

I have some pre-processed text (tokenized, pos-tagged, ect.) and I want to load it into sciki-learn.

The documents have this shape:

Testo   -   SPN Testo   testare+v+indic+pres+nil+1+sing testo+n+m+sing  O
:   -   XPS colon   colon+punc  O
"   -   XPO "   quotation_mark+punc O
Buongiorno  -   I   buongiorno  buongiorno+inter buongiorno+n+m+_   O
a   -   E   a   a+prep  O
tutti   -   PP  tutto   tutto+adj+m+plur+pst+ind tutto+pron+_+m+_+plur+ind  O
.      XPS full_stop   full_stop+punc  O
Ci  -   PP  pro loc+pron+loc+_+3+_+clit pro+pron+accdat+_+1+plur+clit   O
sarebbe -   VI  essere  essere+v+cond+pres+nil+2+sing   O
molto   -   B   molto   molto+adj+m+sing+pst+ind

So it's a tab separeted text file of 6 columns (word, end of sentence marker, part of speech, lemma, morphological information and named entity recognition marker).

Every file represents a document to classify.

What would be the best way to shape them for scikit learn?

Prepare data for scikit-learn

Answers (1)

Related Questions