Reputation:
I am working on a small NLP project of authorship attribution: I have some texts from two authors and I want to say who wrote them.
I have some pre-processed text (tokenized, pos-tagged, ect.) and I want to load it into sciki-learn.
The documents have this shape:
Testo - SPN Testo testare+v+indic+pres+nil+1+sing testo+n+m+sing O
: - XPS colon colon+punc O
" - XPO " quotation_mark+punc O
Buongiorno - I buongiorno buongiorno+inter buongiorno+n+m+_ O
a - E a a+prep O
tutti - PP tutto tutto+adj+m+plur+pst+ind tutto+pron+_+m+_+plur+ind O
. <eos> XPS full_stop full_stop+punc O
Ci - PP pro loc+pron+loc+_+3+_+clit pro+pron+accdat+_+1+plur+clit O
sarebbe - VI essere essere+v+cond+pres+nil+2+sing O
molto - B molto molto+adj+m+sing+pst+ind
So it's a tab separeted text file of 6 columns (word, end of sentence marker, part of speech, lemma, morphological information and named entity recognition marker).
Every file represents a document to classify.
What would be the best way to shape them for scikit learn?
Upvotes: 1
Views: 768
Reputation: 832
The structure they use in scikit-learn example https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html# is described here http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html
Replace this
# Load some categories from the training set
if opts.all_categories:
categories = None
else:
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
if opts.filtered:
remove = ('headers', 'footers', 'quotes')
else:
remove = ()
print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
data_test = fetch_20newsgroups(subset='test', categories=categories,
shuffle=True, random_state=42,
remove=remove)
with your data load statements, for example:
# Load some categories from the training set
categories = [
'high',
'low',
]
print("loading dataset for categories:")
print(categories if categories else "all")
train_path='c:/Users/username/Documents/SciKit/train'
data_train = load_files(train_path, encoding='latin1')
test_path='c:/Users/username/Documents/SciKit/test'
data_test = load_files(test_path, encoding='latin1')
and in each of train and test directories create "high" and "low" subdirectories for your category files.
Upvotes: 1