Reputation: 799
I am trying to split up a text file into words, with \n
being counted as a word.
My input is this text file:
War and Peace
by Leo Tolstoy/Tolstoi
And I want a list output like this:
['War','and','Peace','\n','\n','by','Leo','Tolstoy/Tolstoi']
Using .split()
I get this:
['War', 'and', 'Peace\n\nby', 'Leo', 'Tolstoy/Tolstoi']
So I started writing a program to put the \n as a separate entry after the word, code following:
for oldword in text:
counter = 0
newword = oldword
while "\n" in newword:
newword = newword.replace("\n","",1)
counter += 1
text[text.index(oldword)] = newword
while counter > 0:
text.insert(text.index(newword)+1, "\n")
counter -= 1
However, the program seems to hang on the line counter -= 1
, and I can't for the life of me figure out why.
NOTE: I realise that were this to work, the result would be ['Peaceby',"\n","\n"]; that is a different problem to be solved later.
Upvotes: 4
Views: 7588
Reputation: 30258
As you are reading a file you can handle things line by line which allows you split a line at a time handling the newlines appropriately:
>>> [word for line in inputFile for word in line.rstrip('\n').split() + ['\n']]
['War', 'and', 'Peace', '\n', '\n', 'by', 'Leo', 'Tolstoy/Tolstoi', '\n']
The simple breakdown:
for line in inputFile
: For each line in the inputFilefor word in line.rstrip('\n').split() + ['\n']
: Strip off the newline and split the line adding the new line back on as a separate elementAs noted if you use split()
with no separator then you don't actually need the rstrip('\n')
.
You could use these exact expressions as a loop instead of a list comprehension:
result = []
for line in inputFile:
for word in line.rstrip('\n').split():
result.append(word)
result.append('\n')
print(result)
Which gives the sames output:
['War', 'and', 'Peace', '\n', '\n', 'by', 'Leo', 'Tolstoy/Tolstoi', '\n']
Upvotes: 0
Reputation: 85422
This yet another variation:
with open('data.txt') as fobj:
for line in fobj:
words.extend(line.split())
words.append('\n')
It splits for words at all spaces including tabs.
Upvotes: 0
Reputation: 107287
You don't need such complicated way, You can simply use regex and re.findall()
to find all the words and new lines:
>>> s="""War and Peace
...
... by Leo Tolstoy/Tolstoi"""
>>>
>>> re.findall(r'\S+|\n',s)
['War', 'and', 'Peace', '\n', '\n', 'by', 'Leo', 'Tolstoy/Tolstoi']
'\S+|\n'
will match all the combinations of none whitespace character with length 1 or more (\S+
) or new line (\n
).
If you want to get the text from a file you can do the following:
with open('file_name') as f:
re.findall(r'\S+|\n',f.read())
Read more about regular expressions http://www.regular-expressions.info/
Upvotes: 7
Reputation: 6360
In order to get rid of both \n
characters and split by spaces successfully to get each index of the list being a different word, you can first replace the values of \n\n
with a single space...string.replace('\n\n', ' ')
and equate it to a new string, then split by spaces...newString.split(' ')
Upvotes: 0