jeffrey
jeffrey

Reputation: 3354

How can I use my .conll file from nlp parser for feature selection

I have an outputted .conll format file from Malt Parser, which is using the engmalt.linear-1.7.mco training model. My original input was a large text file of sentences. How can I use this file for feature selection?

I am using python with Scikit-learn (currently using tfidf bag of words to select features). However, I want to utilize nlp, by for example, only searching for adjectives. How can I do this with a conll file?

Upvotes: 0

Views: 2123

Answers (1)

Chthonic Project
Chthonic Project

Reputation: 8366

The output of a parser in the CoNLL-X format provides a separate column for the part-of-speech tags. For example, if you parse the sentence

"I want to select adjectives only, and disregard other tags."

the output might be as follows:

1   I           _   PRP PRP _   2   nsubj   _   _
2   want        _   VB  VBP _   0   null    _   _
3   to          _   TO  TO  _   4   aux _   _
4   select      _   VB  VB  _   2   xcomp   _   _
5   adjectives  _   NN  NNS _   4   dobj    _   _
6   only        _   RB  RB  _   4   advmod  _   _
7   ,           _   ,   ,   _   2   punct   _   _
8   and         _   CC  CC  _   2   cc  _   _
9   disregard   _   VB  VB  _   2   conj    _   _
10  other       _   JJ  JJ  _   11  amod    _   _
11  tags        _   NN  NNS _   9   dobj    _   _
12  .           _   .   .   _   2   punct   _   _

Columns 4 and 5 show the coarse- and fine-grained part-of-speech tags, respectively. If you only want to select adjectives, you need to just pick words that have JJ as their coarse-tag in column 4.

Once you have selected the specific words according to whatever your selection criteria is, you can proceed to construct the vectors in the usual way.

P.S. I assumed your query was mostly to do with the CoNLL format, and not about how to extract the adjectives (which, of course, can be done by tab-splitting rows or regex matching -- there are several questions and answers on SO pertaining to the pythonic ways of doing those).

Upvotes: 3

Related Questions