Reputation: 325
I am trying to parse and convert content from books from epub format to my own structure but i am having trouble detecting and extracting all the text between each chapter, how can i aacomplish that?
here is the two epub files i want it work on, and eventually on others: http://www.gutenberg.org/ebooks/11.epub.noimages?session_id=f5b366deca86ee5e978d79f53f4fcaf1e0ac32ca
http://www.gutenberg.org/ebooks/98.epub.noimages?session_id=f5b366deca86ee5e978d79f53f4fcaf1e0ac32ca
I am able to get each chapters title in to a dictionary like so:
{'ALICE’S ADVENTURES IN WONDERLAND': [], 'THE MILLENNIUM FULCRUM EDITION 3.0': [], 'Contents': [], 'CHAPTER I. Down the Rabbit-Hole': [], 'CHAPTER II. The Pool of Tears': [], 'CHAPTER III. A Caucus-Race and a Long Tale': [], 'CHAPTER IV. The Rabbit Sends in a Little Bill': [], 'CHAPTER V. Advice from a Caterpillar': [], 'CHAPTER VI. Pig and Pepper': [], 'CHAPTER VII. A Mad Tea-Party': [], 'CHAPTER VIII. The Queen’s Croquet-Ground': [], 'CHAPTER IX. The Mock Turtle’s Story': [], 'CHAPTER X. The Lobster Quadrille': [], 'CHAPTER XI. Who Stole the Tarts?': [], 'CHAPTER XII. Alice’s Evidence': []}
I want to get the text between each chapter into that list, but i am having a lot of trouble
Here is how i get the chapter:
import sys
import lxml
import ebooklib
from ebooklib import epub
from ebooklib.utils import debug
from lxml import etree
from io import StringIO, BytesIO
import csv, json
bookJSON = {}
chapterNav = {}
chapterTitle = {}
chapterCont = {}
def parseNAV(xml):
"""
Parse the xml
"""
root = etree.fromstring(xml)
for appt in root.getchildren():
for elem in appt.getchildren():
#print(elem.tag)
for child in elem.getchildren():
#print(child.tag)
if("content" in child.tag):
srcTag = child.get("src")
#print(child.tag + " src: " + srcTag)
contentList = srcTag.split("#")
#print(contentList[1])
chapterNav[contentList[1]] = text
chapterTitle[text.strip()] = []
chapterCont[text.strip()] = []
for node in child.getchildren():
if not node.text:
text = "None"
else:
text = node.text
#print(node.tag + " => " + text)
#print(elem.tag + " CLOSED" + "\n")
def parseContent(xml):
"""
Parse the xml
"""
root = etree.fromstring(xml)
chaptText = []
chapter= ''
for appt in root.getchildren():
for elem in appt.getchildren():
if(elem.text != None and stringify_children(elem) != None):
if("h2" in elem.tag):
print(stringify_children(elem))
if (elem.text).strip() in chapterTitle.keys():
chapterCont[elem.text.strip()] = chaptText
chaptText = []
else:
chaptText.append(stringify_children(elem))
def stringify_children(node):
return (''.join(node.itertext()).strip()).replace("H2 anchor","")
book = epub.read_epub(sys.argv[1])
# debug(book.metadata)
def getData(id,book,bookJSON):
data = list(book.get_metadata('DC', id))
if(len(data) != 0):
bookJSON[id] = []
for x in data:
dataTuple = x
bookJSON[id].append(str(dataTuple[0]))
return bookJSON
return bookJSON
bookJSON = getData('title',book,bookJSON)
bookJSON = getData('creator',book,bookJSON)
bookJSON = getData('identifier',book,bookJSON)
bookJSON = getData('description',book,bookJSON)
bookJSON = getData('language',book,bookJSON)
bookJSON = getData('subject',book,bookJSON)
nav = list(book.get_items_of_type(ebooklib.ITEM_NAVIGATION))
navXml = etree.XML(nav[0].get_content())
#print(nav[0].get_content().decode("utf-8"))
parseNAV(etree.tostring(navXml))
print(bookJSON)
bookContent = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))
for cont in bookContent:
contentXml = etree.XML(cont.get_content())
parseContent(etree.tostring(contentXml))
# print(chapterCont)
# print(chapterNav)
# print(chapterTitle)
ParseContent is the function i am trying to use, currently it works for the first couple chapters then starts for fail miserably. i just want to be able to get all the text from each chapter in to the respective lists. Thank you very much. I am going to keep working on it. if you can offer any help or advice it would be greatly appreciated.
Upvotes: 2
Views: 3810
Reputation: 325
figured out a solution, created an index using chapter titles of where chapters start and saved it in a tuple. then used that tuple to iterate through the content and append all of the content to the respective chapters. Hope this helps the next person looking to parse epubs. if anyone has any better suggestion please let me know. there is not that much information regarding epub parsing online.
Upvotes: 3