Reputation: 77
I'm trying to parse my pdf files and one way to do that is to transform it into html and extracting headings along with their paragraphs. So, I tried pdf2htmlEX and it converted my pdf into html without disturbing my pdf format... So far, I was happy but when I tried to access my headings by using such commands:
>> import subprocess
>> path = "/home/administrator/Documents/pdf_file.pdf"
>> subprocess.call(["pdf2htmlEX" , path])
But when I opened my html file it was giving me unnecessary stuff along with my text and more importantly my text doesn't have heading tags just bunch of divs and span.
>> f = open('/home/administrator/Documents/pdf_file.html','r')
>> f = f.read()
>> print f
I even tried to access it using BeautifulSoup
>> from bs4 import BeautifulSoup as bs
>> soup = BeautifulSoup(f)
>> soup.find('div', attrs={'class': 'site-content'}).h1
It didn't gave me anything coz there was no tags. I have also tried HTMLParser
from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class myhtmlparser(HTMLParser):
def __init__(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
def handle_starttag(self, tag, attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self, data):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
parser = myhtmlparser()
parser.feed(f)
# Extract data from parser
tags = parser.NEWTAGS
attrs = parser.NEWATTRS
data = parser.HTMLDATA
# Clean the parser
parser.clean()
# Print out our data
#print tags
print data
but they all are not fulfilling my required desire. All I want is to extract each headings along with their required paragraphs from that html file is that too much to ask... :p I searched almost every site and read almost everything on this but all my effort ends in vain. Plz guide me in this...
Upvotes: 3
Views: 30344
Reputation: 11
If it's python3 and up, it should be
outputFilename = outputDir + filename.replace(".pdf",".html")
subprocess.run(["pdf2htmlEX",file,outputFilename])
Upvotes: 1