AynonT
AynonT

Reputation: 325

how to parse text from each chapter in epub?

I am trying to parse and convert content from books from epub format to my own structure but i am having trouble detecting and extracting all the text between each chapter, how can i aacomplish that?

here is the two epub files i want it work on, and eventually on others: http://www.gutenberg.org/ebooks/11.epub.noimages?session_id=f5b366deca86ee5e978d79f53f4fcaf1e0ac32ca

http://www.gutenberg.org/ebooks/98.epub.noimages?session_id=f5b366deca86ee5e978d79f53f4fcaf1e0ac32ca

I am able to get each chapters title in to a dictionary like so:

{'ALICE’S ADVENTURES IN WONDERLAND': [], 'THE MILLENNIUM FULCRUM EDITION 3.0': [], 'Contents': [], 'CHAPTER I. Down the Rabbit-Hole': [], 'CHAPTER II. The Pool of Tears': [], 'CHAPTER III. A Caucus-Race and a Long Tale': [], 'CHAPTER IV. The Rabbit Sends in a Little Bill': [], 'CHAPTER V. Advice from a Caterpillar': [], 'CHAPTER VI. Pig and Pepper': [], 'CHAPTER VII. A Mad Tea-Party': [], 'CHAPTER VIII. The Queen’s Croquet-Ground': [], 'CHAPTER IX. The Mock Turtle’s Story': [], 'CHAPTER X. The Lobster Quadrille': [], 'CHAPTER XI. Who Stole the Tarts?': [], 'CHAPTER XII. Alice’s Evidence': []}

I want to get the text between each chapter into that list, but i am having a lot of trouble

Here is how i get the chapter:

import sys
import lxml
import ebooklib
from ebooklib import epub
from ebooklib.utils import debug
from lxml import etree
from io import StringIO, BytesIO
import csv, json

bookJSON = {}
chapterNav = {}
chapterTitle = {}
chapterCont = {}
def parseNAV(xml):
    """
    Parse the xml
    """

    root = etree.fromstring(xml)

    for appt in root.getchildren():
        for elem in appt.getchildren():
            #print(elem.tag)
            for child in elem.getchildren():
                #print(child.tag)
                if("content" in child.tag):
                    srcTag = child.get("src")
                    #print(child.tag + " src: " + srcTag)
                    contentList = srcTag.split("#")
                    #print(contentList[1])
                    chapterNav[contentList[1]] = text
                    chapterTitle[text.strip()] = []
                    chapterCont[text.strip()] = []
                for node in child.getchildren():
                    if not node.text:
                        text = "None"
                    else:
                        text = node.text
                    #print(node.tag + " => " + text)
            #print(elem.tag + " CLOSED"  + "\n")

def parseContent(xml):
    """
    Parse the xml
    """

    root = etree.fromstring(xml)
    chaptText = []
    chapter= ''
    for appt in root.getchildren():
        for elem in appt.getchildren():
            if(elem.text != None and stringify_children(elem) != None):
                if("h2" in elem.tag):
                    print(stringify_children(elem))
                if (elem.text).strip() in chapterTitle.keys():
                    chapterCont[elem.text.strip()] = chaptText
                    chaptText = []
                else:
                    chaptText.append(stringify_children(elem))
def stringify_children(node):
    return (''.join(node.itertext()).strip()).replace("H2 anchor","")

book = epub.read_epub(sys.argv[1])

# debug(book.metadata)

def getData(id,book,bookJSON):
    data = list(book.get_metadata('DC', id))
    if(len(data) != 0):
        bookJSON[id] = []
        for x in data:
            dataTuple = x
            bookJSON[id].append(str(dataTuple[0]))
        return bookJSON
    return bookJSON


bookJSON =  getData('title',book,bookJSON)
bookJSON = getData('creator',book,bookJSON)
bookJSON = getData('identifier',book,bookJSON)
bookJSON = getData('description',book,bookJSON)
bookJSON = getData('language',book,bookJSON)
bookJSON = getData('subject',book,bookJSON)
nav = list(book.get_items_of_type(ebooklib.ITEM_NAVIGATION))
navXml = etree.XML(nav[0].get_content())
#print(nav[0].get_content().decode("utf-8"))


parseNAV(etree.tostring(navXml))
print(bookJSON)

bookContent = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))
for cont in bookContent:
    contentXml = etree.XML(cont.get_content())

    parseContent(etree.tostring(contentXml))
# print(chapterCont)
# print(chapterNav)
# print(chapterTitle)

ParseContent is the function i am trying to use, currently it works for the first couple chapters then starts for fail miserably. i just want to be able to get all the text from each chapter in to the respective lists. Thank you very much. I am going to keep working on it. if you can offer any help or advice it would be greatly appreciated.

Upvotes: 2

Views: 3810

Answers (1)

AynonT
AynonT

Reputation: 325

figured out a solution, created an index using chapter titles of where chapters start and saved it in a tuple. then used that tuple to iterate through the content and append all of the content to the respective chapters. Hope this helps the next person looking to parse epubs. if anyone has any better suggestion please let me know. there is not that much information regarding epub parsing online.

Upvotes: 3

Related Questions