10101
10101

Reputation: 2402

Parse PDF file to memory and perform search for certain value

I am rather new to the whole C# thing and trying to learn it in more practical way to gather more interest and understanding. I have a code that is parsing PDF https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf file and functioning good. However I would like to write to memory instead of console, in order to search for InvoiceNumber from it later.

My current code for writing into console:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace PDF_file_reader
{
    class Program
    {
        static void Main(string[] args)
        {

            List<int> InvoiceNumbers = new List<int>();

            string filePath = @"C:\temp\parser\Invoice_Template.pdf";
            int pagesToScan = 2;

            string strText = string.Empty;
            try
            {
                PdfReader reader = new PdfReader(filePath);

                for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
                {
                    ITextExtractionStrategy its = new LocationTextExtractionStrategy();
                    strText = PdfTextExtractor.GetTextFromPage(reader, page, its);

                    strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
                    //creating the string array and storing the PDF line by line
                    string[] lines = strText.Split('\n');
                    foreach (string line in lines)
                    {
                        {
                            //Console.WriteLine($"<{line}>");
                            Console.WriteLine(line.ToString());
                        }
                    }

                    Console.Read();
                }

            }
            catch (Exception ex)
            {
                Console.Write(ex);
            }
        }
    }
}

Here is an output in console:

enter image description here

How to write to InvoiceNumbers list instead of Console what I am doing now and perform search out of it? I guess with my current setup search would be not possible?

Upvotes: 0

Views: 615

Answers (1)

Matt G
Matt G

Reputation: 36

Just a note, you have an extra set of { } in your foreach loop surrounding Console.Writeline() that you can remove.

If you want to store the whole invoice number as it is highlighted in your screenshot ("INV-3337" instead of just "3337"), InvoiceNumbers needs to be a list of strings, not ints.

I assume the invoice is always going to be the same, or the number is always going to be the same format (i.e. "Invoice Number 'INV-####"), you could just add a line in your foreach loop. Since each line is a string, you can check if line contains "Invoice Number". If it does, you can add it to InvoiceNumbers and remove the phrase "Invoice Number". Then trim it to get rid of any whitespace. Either above or below Console.Writeline(line.ToString()); you would just add:

if (line.Contains("Invoice Number"))
    InvoiceNumbers.Add(line.Replace("Invoice Number", "").Trim());

(I used Replace() instead of Remove() because you would either need to know the start and end positions of the phrase you want to remove. In my opinion, Replace() is the safest route for this particular situation)

You can add break; to the if statement if that's all you're looking for as well. This will stop the foreach loop. Once you extract the invoice number, there is no reason to look through the rest of the document, unless you have multiple invoices in one document.

if (line.Contains("Invoice Number"))
{
    InvoiceNumbers.Add(line.Replace("Invoice Number", "").Trim());
    break;
}

If you want to search through the list for a particular invoice number, this answer should help with that.

This is assuming that the only difference would be the actual number. If it's not, you could always look into regular expressions and have it look for a pattern like "INV-\d*". That would also be assuming the invoice number format is always the same.

Upvotes: 1

Related Questions