gweintraub
gweintraub

Reputation: 65

Parsing a book into chapters – Python

I have a large book stored in a single plain text file and want to parse it in order to create individual files for each chapter. I some simple regex that finds each chapter title, but I'm struggling at capturing all of the text in between.

import re

txt = open('book.txt', 'r')

for line in txt :
    if re.match("^[A-Z]+$", line):
        print line,

I know this is fairly rudimentary, but I'm new enough to python that it's got me a bit stumped. At the moment I'm going line by line, so my thought process is:

  1. If the line is a chapter title: Make a new file 'chapter_title.txt'
  2. If the next line isn't a chapter title: Write the line to chapter_title.txt

My attempts to actually write that out have been less successful though. Appreciate the help!

Edit: Specifically, I'm confused by the Python syntax for file I/O. I've tried:

for line in txt :
    if re.match("^[A-Z]+$", line):
        f = open(line + '.txt', 'w')
    else f.write(line + "\n")

as my general approach, but that's not gonna work as written. Hoping for help structuring the loops. Thanks

Upvotes: 3

Views: 3891

Answers (3)

Professor Hex
Professor Hex

Reputation: 37

You asked for help with syntax.

The full grammar of python is here https://docs.python.org/2/reference/grammar.html?highlight=grammar.

Check out more wordy python docs here @ https://docs.python.org/2/reference/compound_stmts.html#the-if-statement to read up on compound statements (with, for & if) to know the syntax more accurately.

Also, see https://docs.python.org/2/library/functions.html#open for knowing about the Built-In function open().

Stay consistent with the indentation of code blocks and remember that a : must follow every statement before the suite.

import re

with open('book.txt', 'r') as corpus:
    eye = corpus.readlines()

verdad = False
lambda l: re.match("^[A-Z]+$", l)

for line in eye:
    if l(line):
        if verdad: verdad.close()
        verdad = open(line.replace(' ','_') + '.txt', 'w')
    elif ! l(line):
        if verdad: verdad.close()
    else:
        verdad.write(line + "\n")

Upvotes: -1

MervS
MervS

Reputation: 5902

Perhaps you can also try the following:

import re

with open('book.txt', 'r') as file:
    lines = file.read()

contents = re.split("[A-Z]+", lines)
for i in range(1, len(contents), 2):
    with open(contents[i] + '.txt', 'w') as file:
        file.write(contents[i+1])

The book contents are split by the chapter title. The resulting chapter contents (contents[i+1]) are then written in the chapter file (contents[i] + '.txt').

Edit: this assumes that you have a fixed pattern for the chapter titles.

Upvotes: 1

Remi Guan
Remi Guan

Reputation: 22312

I think this will work:

import re

with open('book.txt', 'r') as file:
    txt = file.readlines()

f = False

for line in txt:
    if re.match("^[A-Z]+$", line):
        if f: f.close()
        f = open(line + '.txt', 'w')

    else:
        f.write(line + "\n")

Maybe I should add some explanation:

  1. with will auto close the file. Close an opened file is important.

  2. readlines() function can read the file by lines and save the output to a list.

  3. Here I'm using f = False. So first time if f: will be False.

Now here is important, if the file f has been opened, then if f: will be True and the file will be closed by f.close()(but the first time f.close() will not run).

And then, f = open(line + '.txt', 'w') will write text into that file, when re.match("^[A-Z]+$", line) is True the file will be closed, and open another file, and again, again until the txt list is empty.

Upvotes: 1

Related Questions