Reputation: 65
I have a large book stored in a single plain text file and want to parse it in order to create individual files for each chapter. I some simple regex that finds each chapter title, but I'm struggling at capturing all of the text in between.
import re
txt = open('book.txt', 'r')
for line in txt :
if re.match("^[A-Z]+$", line):
print line,
I know this is fairly rudimentary, but I'm new enough to python that it's got me a bit stumped. At the moment I'm going line by line, so my thought process is:
My attempts to actually write that out have been less successful though. Appreciate the help!
Edit: Specifically, I'm confused by the Python syntax for file I/O. I've tried:
for line in txt :
if re.match("^[A-Z]+$", line):
f = open(line + '.txt', 'w')
else f.write(line + "\n")
as my general approach, but that's not gonna work as written. Hoping for help structuring the loops. Thanks
Upvotes: 3
Views: 3891
Reputation: 37
The full grammar of python is here https://docs.python.org/2/reference/grammar.html?highlight=grammar.
Check out more wordy python docs here @ https://docs.python.org/2/reference/compound_stmts.html#the-if-statement to read up on compound statements (with, for & if) to know the syntax more accurately.
Also, see https://docs.python.org/2/library/functions.html#open for knowing about the Built-In function open().
Stay consistent with the indentation of code blocks and remember that a :
must follow every statement before the suite.
import re
with open('book.txt', 'r') as corpus:
eye = corpus.readlines()
verdad = False
lambda l: re.match("^[A-Z]+$", l)
for line in eye:
if l(line):
if verdad: verdad.close()
verdad = open(line.replace(' ','_') + '.txt', 'w')
elif ! l(line):
if verdad: verdad.close()
else:
verdad.write(line + "\n")
Upvotes: -1
Reputation: 5902
Perhaps you can also try the following:
import re
with open('book.txt', 'r') as file:
lines = file.read()
contents = re.split("[A-Z]+", lines)
for i in range(1, len(contents), 2):
with open(contents[i] + '.txt', 'w') as file:
file.write(contents[i+1])
The book contents are split by the chapter title. The resulting chapter contents (contents[i+1]
) are then written in the chapter file (contents[i] + '.txt'
).
Edit: this assumes that you have a fixed pattern for the chapter titles.
Upvotes: 1
Reputation: 22312
I think this will work:
import re
with open('book.txt', 'r') as file:
txt = file.readlines()
f = False
for line in txt:
if re.match("^[A-Z]+$", line):
if f: f.close()
f = open(line + '.txt', 'w')
else:
f.write(line + "\n")
Maybe I should add some explanation:
with
will auto close the file. Close an opened file is important.
readlines()
function can read the file by lines and save the output to a list.
Here I'm using f = False
. So first time if f:
will be False
.
Now here is important, if the file f
has been opened, then if f:
will be True
and the file will be closed by f.close()
(but the first time f.close()
will not run).
And then, f = open(line + '.txt', 'w')
will write text into that file, when re.match("^[A-Z]+$", line)
is True
the file will be closed, and open another file, and again, again until the txt
list is empty.
Upvotes: 1