shikha singh
shikha singh

Reputation: 488

Pdf to XML/json using Python module

I am able to read the text from pdf using code:

import pdfx
pdf = pdfx.PDFx("1951.pdf")
metadata = pdf.get_metadata()
reference_list = pdf.get_references()
reference_dict = pdf.get_references_as_dict()
pdf.download_pdfs("D:/")
pdf.get_text()

But can't convert it into json:

pdfx -d D:/Output/ -j -o output.json pdf
SyntaxError: invalid syntax

Syntax: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

Upvotes: 1

Views: 34536

Answers (1)

shikha singh
shikha singh

Reputation: 488

I was able to convert to XML using the Pdfminer Python module.

  1. Download and unzip the module from http://pypi.python.org/pypi/pdfminer/
  2. Run on shell: python pdf2txt.py -o samples/output.xml -t xml samples/1951.pdf
  3. For text : python pdf2txt.py samples/1951.pdf

Upvotes: 8

Related Questions