Reputation: 3354
I have an outputted .conll format file from Malt Parser, which is using the engmalt.linear-1.7.mco training model. My original input was a large text file of sentences. How can I use this file for feature selection?
I am using python with Scikit-learn (currently using tfidf bag of words to select features). However, I want to utilize nlp, by for example, only searching for adjectives. How can I do this with a conll file?
Upvotes: 0
Views: 2123
Reputation: 8366
The output of a parser in the CoNLL-X format provides a separate column for the part-of-speech tags. For example, if you parse the sentence
"I want to select adjectives only, and disregard other tags."
the output might be as follows:
1 I _ PRP PRP _ 2 nsubj _ _
2 want _ VB VBP _ 0 null _ _
3 to _ TO TO _ 4 aux _ _
4 select _ VB VB _ 2 xcomp _ _
5 adjectives _ NN NNS _ 4 dobj _ _
6 only _ RB RB _ 4 advmod _ _
7 , _ , , _ 2 punct _ _
8 and _ CC CC _ 2 cc _ _
9 disregard _ VB VB _ 2 conj _ _
10 other _ JJ JJ _ 11 amod _ _
11 tags _ NN NNS _ 9 dobj _ _
12 . _ . . _ 2 punct _ _
Columns 4 and 5 show the coarse- and fine-grained part-of-speech tags, respectively. If you only want to select adjectives, you need to just pick words that have JJ
as their coarse-tag in column 4.
Once you have selected the specific words according to whatever your selection criteria is, you can proceed to construct the vectors in the usual way.
P.S. I assumed your query was mostly to do with the CoNLL format, and not about how to extract the adjectives (which, of course, can be done by tab-splitting rows or regex matching -- there are several questions and answers on SO pertaining to the pythonic ways of doing those).
Upvotes: 3