Christopher Riches
Christopher Riches

Reputation: 799

Python splitting text file keeping newlines

I am trying to split up a text file into words, with \n being counted as a word.

My input is this text file:

War and Peace

by Leo Tolstoy/Tolstoi

And I want a list output like this:

['War','and','Peace','\n','\n','by','Leo','Tolstoy/Tolstoi']

Using .split() I get this:

['War', 'and', 'Peace\n\nby', 'Leo', 'Tolstoy/Tolstoi']

So I started writing a program to put the \n as a separate entry after the word, code following:

for oldword in text:
counter = 0
newword = oldword
while "\n" in newword:
    newword = newword.replace("\n","",1)
    counter += 1

text[text.index(oldword)] = newword

while counter > 0:
    text.insert(text.index(newword)+1, "\n")
    counter -= 1

However, the program seems to hang on the line counter -= 1, and I can't for the life of me figure out why.

NOTE: I realise that were this to work, the result would be ['Peaceby',"\n","\n"]; that is a different problem to be solved later.

Upvotes: 4

Views: 7588

Answers (4)

AChampion
AChampion

Reputation: 30258

As you are reading a file you can handle things line by line which allows you split a line at a time handling the newlines appropriately:

>>> [word for line in inputFile for word in line.rstrip('\n').split() + ['\n']]
['War', 'and', 'Peace', '\n', '\n', 'by', 'Leo', 'Tolstoy/Tolstoi', '\n']

The simple breakdown:

  • for line in inputFile: For each line in the inputFile
  • for word in line.rstrip('\n').split() + ['\n']: Strip off the newline and split the line adding the new line back on as a separate element

As noted if you use split() with no separator then you don't actually need the rstrip('\n').

You could use these exact expressions as a loop instead of a list comprehension:

result = []
for line in inputFile:
    for word in line.rstrip('\n').split():
        result.append(word)
    result.append('\n')
print(result)

Which gives the sames output:

['War', 'and', 'Peace', '\n', '\n', 'by', 'Leo', 'Tolstoy/Tolstoi', '\n']

Upvotes: 0

Mike Müller
Mike Müller

Reputation: 85422

This yet another variation:

with open('data.txt') as fobj:
    for line in fobj:
        words.extend(line.split())
        words.append('\n')

It splits for words at all spaces including tabs.

Upvotes: 0

Kasravnd
Kasravnd

Reputation: 107287

You don't need such complicated way, You can simply use regex and re.findall() to find all the words and new lines:

>>> s="""War and Peace
... 
... by Leo Tolstoy/Tolstoi"""
>>> 
>>> re.findall(r'\S+|\n',s)
['War', 'and', 'Peace', '\n', '\n', 'by', 'Leo', 'Tolstoy/Tolstoi']

'\S+|\n' will match all the combinations of none whitespace character with length 1 or more (\S+) or new line (\n).

If you want to get the text from a file you can do the following:

with open('file_name') as f:
     re.findall(r'\S+|\n',f.read())

Read more about regular expressions http://www.regular-expressions.info/

Upvotes: 7

m_callens
m_callens

Reputation: 6360

In order to get rid of both \n characters and split by spaces successfully to get each index of the list being a different word, you can first replace the values of \n\n with a single space...string.replace('\n\n', ' ') and equate it to a new string, then split by spaces...newString.split(' ')

Upvotes: 0

Related Questions