akshit bhatia
akshit bhatia

Reputation: 395

Preprocessing the data from a text file with a split method

I have written below a sample of text. What I want is to append this text into a list data structure in python. I first split this text using '<EOS>' as a delimiter. And then append each element of the result of split method into the list data type.

But what I am facing is that the split method is splitting the text with '\n' and '<EOS>' as its delimiters. Because of this, now a single line is being added to the list data type but not the complete portion.

Kindly look at the code that follows the sample text below and let me know what am I doing wrong.

Old Major, the old boar on the Manor Farm, summons the animals on the farm together for a meeting, during which he refers to humans as "enemies" and teaches the animals a revolutionary song called "Beasts of England".
When Major dies, two young pigs, Snowball and Napoleon, assume command and consider it a duty to prepare for the Rebellion.<EOS>
Alex is a 15-year-old living in near-future dystopian England who leads his gang on a night of opportunistic, random "ultra-violence".
Alex's friends ("droogs" in the novel's Anglo-Russian slang, 'Nadsat') are Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and Pete, who mostly plays along as the droogs indulge their taste for ultra-violence.
Characterised as a sociopath and a hardened juvenile delinquent, Alex also displays intelligence, quick wit, and a predilection for classical music; he is particularly fond of Beethoven, referred to as "Lovely Ludwig Van".`

Python Code to read documents into list type:

f=open('./plots')
documents=[]
for x in f:
    documents.append(x.split('<EOS>'))
print documents[0]

#documents[0] must start from 'Old Major' and stops at 'Rebellion'.

Upvotes: 1

Views: 553

Answers (3)

martineau
martineau

Reputation: 123541

split() is not splitting the text with '\n' and '<EOS>', it's only doing it with regards to the latter. The for x in f: effectively splits the file's contents by newlines (\n) though.

Here's code largely equivalent to yours which illustrates what's going on better:

with open('./plots') as f:
    documents=[]
    for x in f:
        documents.append(x.split('<EOS>'))

for i, document in enumerate(documents):
    print('documents[{}]: {!r}'.format(i, document))

Upvotes: 1

Ouroborus
Ouroborus

Reputation: 16894

split('<EOS>') is only splitting on <EOS> as you expect. However, for x in f: works line-by-line and so is effectively performing an implicit split on your file.

Instead, maybe do something like this:

f=open('./plots')
documents=f.read().split('<EOS>')
print documents[0]

Upvotes: 1

ack_inc
ack_inc

Reputation: 1113

Looping over f causes the file contents to be split by newline. Use this instead:

f=open('./plots')
documents=f.read().split('<EOS>')
print documents[0]

Upvotes: 1

Related Questions