
Reputation: 2863

pdf to xml conversion using .NET

I am currently building a .NET application and one of the requirement is that it has to convert a pdf file to XML file. Has anyone had success doing this? If so what have you used?

Upvotes: 6

Views: 48433

Answers (4)

Joris Schellekens
Joris Schellekens

Reputation: 9057

Have a look at pdf2Data.

It converts pdf files to XML files based on a template. Templates are defined using selectors that allow the end-user to specify things like "select the table on the 2nd page" or "select the text written in this particular font" and so on.

Keep in mind, I am affiliated with iText so even though my knowledge of PDF is extensive, I may be considered biased towards iText products (seeing as I help develop them).

Upvotes: 1

Carls Jr.
Carls Jr.

Reputation: 3078

I have done this kind of project a lot of times before.

Things you need to do:

1.) Check out this project Extract Text from PDF in C#. The project uses ITextSharp.

  • It would be better if you download the sample project and have a look on how it works. In this project it shows how to extract data from a pdf. Check out the PDFParser class, it has the function named ExtractTextFromPDFBytes(byte[] input) from that function you can see how the text is being extracted out from the uncompressed pdf file. Don't forget to include the ITextSharp dll.

PDFParser class

  1  using System;
  2  using System.IO;
  3  using iTextSharp.text.pdf;
  5  namespace PdfToText
  6  {
  7      /// 
  8      /// Parses a PDF file and extracts the text from it.
  9      /// 
 10      public class PDFParser 
 11      {
 12          /// BT = Beginning of a text object operator 
 13          /// ET = End of a text object operator
 14          /// Td move to the start of next line
 15          ///  5 Ts = superscript
 16          /// -5 Ts = subscript
 18          #region Fields
 20          #region _numberOfCharsToKeep
 21          /// 
 22          /// The number of characters to keep, when extracting text.
 23          /// 
 24          private static int _numberOfCharsToKeep = 15;
 25          #endregion
 27          #endregion
 29          #region ExtractText
 30          /// 
 31          /// Extracts a text from a PDF file.
 32          /// 
 33          /// the full path to the pdf file.
 34          /// the output file name.
 35          /// the extracted text
 36          public bool ExtractText(string inFileName, string outFileName)
 37          {
 38              StreamWriter outFile = null;
 39              try
 40              {
 41                  // Create a reader for the given PDF file
 42                  PdfReader reader = new PdfReader(inFileName);
 43                  //outFile = File.CreateText(outFileName);
 44                  outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);
 46                  Console.Write("Processing: ");
 48                  int     totalLen    = 68;
 49                  float   charUnit    = ((float)totalLen) / (float)reader.NumberOfPages;
 50                  int     totalWritten= 0;
 51                  float   curUnit     = 0;
 53                  for (int page = 1; page = 1.0f)
 59                      {
 60                          for (int i = 0; i = 1.0f)
 70                          {
 71                              for (int i = 0; i 
104          /// This method processes an uncompressed Adobe (text) object 
105          /// and extracts text.
106          /// 
107          /// uncompressed
108          /// 
109          private string ExtractTextFromPDFBytes(byte[] input)
110          {
111              if (input == null || input.Length == 0) return "";
113              try
114              {
115                  string resultString = "";
117                  // Flag showing if we are we currently inside a text object
118                  bool inTextObject = false;
120                  // Flag showing if the next character is literal 
121                  // e.g. '\\' to get a '\' character or '\(' to get '('
122                  bool nextLiteral = false;
124                  // () Bracket nesting level. Text appears inside ()
125                  int bracketDepth = 0;
127                  // Keep previous chars to get extract numbers etc.:
128                  char[] previousCharacters = new char[_numberOfCharsToKeep];
129                  for (int j = 0; j = ' ') && (c = 128) && (c 
235          /// Check if a certain 2 character token just came along (e.g. BT)
236          /// 
237          /// the searched token
238          /// the recent character array
239          /// 
240          private bool CheckToken(string[] tokens, char[] recent)
241          {
242              foreach(string token in tokens)
243              {
244                  if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
245                      (recent[_numberOfCharsToKeep - 2] == token[1]) &&
246                      ((recent[_numberOfCharsToKeep - 1] == ' ') ||
247                      (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
248                      (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
249                      ((recent[_numberOfCharsToKeep - 4] == ' ') ||
250                      (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
251                      (recent[_numberOfCharsToKeep - 4] == 0x0a))
252                      )
253                  {
254                      return true;
255                  }
256              }
257              return false;
258          }
259          #endregion
260      }
261  }

2.) Parse the extracted text and create and xml file.

  • Some of my concerns before are the pdf's which contains broken links or urls inside the pages. Now, just in case if you are also concern on this issue, regular expression can easily solve your problem but I suggest you deal with it later on.

  • Now here is a sample code on how to create an xml. Understand how the code works so later on you will know on how to deal with your own code.

    try {
        //XmlDataDocument sourceXML = new XmlDataDocument();
        string xmlFile = Server.MapPath(“DVDlist.xml”);
        //create a XML file is not exist
        System.Xml.XmlTextWriter writer = new System.Xml.XmlTextWriter(xmlFile, null);
        //starts a new document
        //write comments
        writer.WriteComment(“Commentss: XmlWriter Test Program”);
        writer.Formatting = Formatting.Indented;
        writer.WriteAttributeString(“ID”, “1″);
        //write some simple elements
        writer.WriteElementString(“Title”, “Tere Naam”);
        writer.WriteElementString(“Actor”, “Salman Khan”);
    catch (Exception e1) { 

Hope it helps :)

Upvotes: 6


Reputation: 2863

I ended up using Byte Scout's PDF Extractor SDK . It works really well.

Upvotes: 0


Reputation: 9431

You can use a pdf library such as iTextSharp to query your pdf file. Once you have accessed the data you require you can then easily create an xml file. There is a TON of info on the web on how to create an xml file with c# and other .net languages. If you have a specific question, just ask ;-)

Upvotes: 2

Related Questions