Elizabeth
Elizabeth

Reputation: 71

How to convert words into sentence string- Text Classification

So I am currently working with Brown Corpus, and I am having a slight issue. In order to apply tokenize feature, I first need to have the Brown Corpus into sentences. This is what I have so far:

from nltk.corpus import brown
import nltk


target_text = [s for s in brown.fileids()
                   if s.startswith('ca01') or s.startswith('ca02')]

data = []

total_text = [s for s in brown.fileids()
                   if s.startswith('ca01') or s.startswith('ca02') or s.startswith('cp01') or s.startswith('cp02')]


for text in total_text:

    if text in target_text:
        tag = "pos"
    else:
        tag = "neg"
    words=list(brown.sents(total_text))    
    data.extend( [(tag, word) for word in words] )

data

When I do this, I get data that looks like this:

[('pos',
  ['The',
   'Fulton',
   'County',
   'Grand',
   'Jury',
   'said',
   'Friday',
   'an',
   'investigation',
   'of',
   "Atlanta's",
   'recent',
   'primary',
   'election',
   'produced',
   '``',
   'no',
   'evidence',
   "''",
   'that',
   'any',
   'irregularities',
   'took',
   'place',
   '.']),
 ('pos',
  ['The',
   'jury',
   'further',
   'said',
   'in',
   'term-end',
   'presentments',
   'that',
   'the',
   'City',
   'Executive',
   'Committee',
   ',',
   'which',
   'had',
   'over-all',
   'charge',
   'of',
   'the',
   'election',
   ',',
   '``',
   'deserves',
   'the',
   'praise',
   'and',
   'thanks',
   'of',
   'the',
   'City',
   'of',
   'Atlanta',
   "''",
   'for',
   'the',
   'manner',
   'in',
   'which',
   'the',
   'election',
   'was',
   'conducted',
   '.'])

What I need is something that looks like:

[('pos', 'The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election ....'), ('pos', The jury further said in term-end presentments that the City...)]

Is there any way to fix this? This project is taking way longer than I've expected..

Upvotes: 1

Views: 1223

Answers (1)

aghast
aghast

Reputation: 15290

According to the docs, the .sents method returns a list (document) of lists (sentences) of strings (words) - you're not doing anything wrong in your call.

If you want to reconstitute the sentences, you might try just joining them with a space. But this won't really work due to punctuation marks:

data.extend( [(tag, ' '.join(word)) for word in words] )

You'll get things like this:

'the',
'election',
',',
'``',
'deserves',
'the',

which map to:

the election , `` deserves the

Because join doesn't know about punctuation. Does nltk include some kind of punctuation-aware formatter?

Upvotes: 1

Related Questions